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This thesis addresses problems from two areas of 



theoretical computer science. The first area is that of computational 
learning theory, which is the study of the phenomenon of concept 
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learning theory is to investigate learning in a rigorous manner 
through the use of techniques from theoretical computer science. Much 
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in a probabilistic environment. Of particular interest are the 
questions of determining for which classes of concepts the 
PAC-learnirg problem is tractable and discovering efficient learning 
algorithms for such classes. The second area from which topics are 
drawn is that of online algorithms for graph-theoretic problems. Many 
problems in such fields as communications , transportation, 
scheduling , and networking can be reduced to that of finding a good 
graph algorithm. After an introduction in Chapter 1, some background 
information is provided in Chapter 2 on the field of computational 
learning theory. In Chapter 3 it is shown that for any concept class 
having a particular closure property, the existence of a gra:-^ 
algorithm implies that the class is PAC-learnable. Chapter 4 defines 
a variation on the standard PAC model of learning called 
semi-supervised learning, a model which permits the rigorous study of 
learning situations where the teacher plays only a limited role. 
Chapter 5 deals with the problem of prediction as performed by 
deterministic finite automata, counter machines, and deterministic 
pushdown automata. Chapter 6 investigates the power and the 
performance of online algorithms for a certain class of graph 
problems, referred to as vertex labeling problems. (77 references) 
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ABSTRACT 



The distribution-independent model of concept learning from examples ("PAC-learning") 
dne to Valiant is investigated. It has previously been shown that the existence of an Occam 
algorithm for a class of concepts is a sufficient condition for the PAC-learnability of that class. 
(An Occam algorithm is a randomized polynomial-time algorithm that, when given as input 
a sample of strings of some unknown concept to be learned, outputs a small description of a 
concept that is consistent with the sample.) It is shown here that for any class satisfying the 
property of closure under exception lists, the PAC-learnability of the class implies the existence 
of an Occam algorithm for the class. Thus the existence of randomized Occam algorithms 
exactly characterizes PAC-learnability for all concept classes with this property. This reveals a 
close relationship between between F AC-learning and information compression for a wide range 
of interesting classes. 

The PAC-learning model is then extended to that of semi-supervised learning (ss-learning), 
in which a collection of disjoint concepts is to be simultaneously learned with only partial 
information concerning concept membership available to the learning algorithm. It is shown that 
many PAC-learnable concept classes are also ss-learnable. Several sets of sufficient conditions 
for a class to be ss-learnable are given. A prediction-based definition of learning multiple 
concept classes has been given and shown to be equivalent to ss-learning. 

The predictive ability of automata less powerful than Taring machines is investigated. Mod- 
els for prediction by deterministic finite state machines, 1-counter machines, and deterministic 
pushdown automata are defined, and the classes of languages that can be predicted by these 
types of automata are precisely characterized. In particular, these varieties of automata can 
predict exactly the finite classes of regular languages, the finite classes of 1-counter languages, 
and the finite classes of deterministic context-free languages, respectively. In addition, upper 
bounds are given for the size of classes that can be predicted by such automata. 

Two new online protocols for graph algorithms are defined. Bounds on the performance of 
online algorithms for the graph bandwidth, vertex cover, independent set, and dominating set 
problems are demonstrated. Various results are proved for algorithms operating according to a 
standard online protocol as well as the two new protocols. 
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1 INTRODUCTION 



This thesis addresses problems from two areas of theoretical computer science. The first area is 
that of computational learning theory, which is the study of the phenomenon of concept learning 
using formal mathematical models. Learning is a topic of considerable interest to researchers 
in cognitive science and artificial intelligence; the goal uf computational learning theory is 
to investigate learning in a rigorous manner through the use of techniques from theoretical 
computer science. Much of the work in this field is in the context of the PAC (an acronym 
for "probably approximately correct") model of learning, in which learning is carried out in 
a probabilistic environment. Of particular interest are the questions of determining for which 
classes of concepts the PAC-learning problem is tractable and discovering efficient learning 
algorithms for such classes. 

The second area from which topics are drawn is that of online algorithms for graph- theoretic 
problems. Graphs are used to represent a wide variety of problems in such fields as commu- 
nications, transportation, scheduling, and network analysis. Many problems in science and 
engineering can be reduced to that of finding a good graph algorithm. An online algorithm 
is one that receives its input in discrete stages, and at each stage must produce an output 
based only on the information it has seen thus far. Online algorithms model, in a limited sense, 
"real-time" computation, since they must react to their environment as it is being presented to 
them. In addition, online algorithms can be used to study how well algorithms are able to per- 
form with only partial information about the problem instance, and to what extent additional 
computational resources can compensate for incomplete information. 

1.1 Overview 

The material in this thesis is organised as follows. 

Chapter 2 gives some background information on the field of computational learning theory 
in general, and the PAC model of learning in particular. Notation and terminology that will 
be used in the rest of the dissertation are defined. 
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In Chapter 3 it is shown that for any concept class having a particular closure property, 
the existence of an Occam algor'chm implies that the class is PAC-learnable. Separate results 
are proved for the cases when the alphabet used to describe concepts is finite and infinite. 
Combining these with two theorems of Blumex, Ehrenfencht, Hausslex and Wannnth [12, 11] 
yields the result that, for a wide range of interesting concept classes, the existence of an Occam 
algorithm is equivalent to PAC-learnability. This chapter is based on joint work with Leonard 
Pitt [13]. 

In the next chapter a variation on the standard PAC model of learning, called semi- 
supervised learning (ss-learnmg), is defined. This new model permits the rigorous study of 
learning situations in which the teacher plays only a very limited role. We prove that a num- 
ber of interesting PAC-learnable concept classes are also ss-learnable, and give several sets of 
sufficient conditions for a class to be ss-learnable. This chapter is also based on joint work with 
Leonard Pitt [14]. 

Chapter 5 deals with the problem of prediction as performed by automata with less power 
than Turing machines. We define models of prediction in which the prediction is performed by 
deterministic finite automata, counter machines, and deterministic pushdown automata. For 
each of these models we give a precise characterisation of the language classes that can be 
predicted. 

In Chapter 6 we investigate the power of online algorithms for a certain class of graph prob- 
lems, referred to as vertex labeling problems. In addition to a standard online protocol, two new 
online protocols are defined for these problems. We then prove bounds on the performance of 
online algorithms operating according to these protocols for the graph bandwidth, independent 
set, vertex cover, and dominating set problems. 

The final chapter presents a brief summary of results. 
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2 THE PAC MODEL OF LEARNING 



A general model of computational learning can be described as follows. The domain is a set, 
such as the set of points in n- dimensional Euclidean space or the set of all binary strings. A 
concept is - subset of the domain, and a concept clots is a set of concepts. Associated with each 
concept is a description of the concept, called its representation. An example of a concept c is an 
element of the domain, together with a bit that indicates whether or not that element is in the 
concept c. A learning algorithm, or learner, for a concept class C is an algorithm that accepts 
as input examples of some target concept c € C (and possibly some additional information) and 
outputs a hypothesis, which is the algorithm's "guess" as to what c is. Depending on the exact 
model being used, there may be restrictions as to how accurate the hypothesis must be, how 
much time the algorithm is allowed, how the examples are chosen, etc., in order for the algorithm 
to be considered a learning algorithm for the class C. A concept ctass is learnable if there exists 
a learning algorithm for it. There are a number of different models of computational learning; 
these models vary considerably, but almost all share the characteristics just described. One 
mudel that has received considerable attention in the literature recently is the PAC ("probably 
approximately correct") model. 

The PAC model of learning was introduced by Valiant in [73]. It his been widely used 
to investigate the learnability of concept classes in several domains (see, for example, papers 
in [39] and [67]). Much of its appeal is due to the fact that, rather than requiring the learning 
algorithm to always be exactly right, it is sufficient for the algorithm to almost aforays find a 
hypothesis that is highly, although perhaps not precisely, accurate. By permitting the learner 
this leeway, the model allows some interesting concept classes to be learned in polynomial time. 
In addition, requiring only approximate correctness is intuitively appealing, since it seems more 
closely related to human learning than does requiring exact correctness. 

In the PAC model, the learning algorithm is given an amount of time polynomial in the 
length of the r e p re s entation of the concept to be learned and the length of the examples that 
are presented. The model assumes that the examples of the unknown concept that the learning 
algorithm receives have been selected randomly according to some fixed but arbitrary and 



unknown probability distribution over examples of some maximum length n. The algorithm 
must, for any such distribution, output a hypothesis that, with high probability, will have a low 
distribution- weighted error relative to the unknown concept. 

The following set notation is used throughout this thesis, not just in the chapters on com- 
putational learning. If 5 and T are sets, then 5 C T and 5 C T denote that 5 is a subset 
and proper subset, respectively, of T. 5 U T represents the anion of 5 and T, and 5 n T their 
intersection. The symbol € indicates set containment, so 2 ~ S means that z is an element of 
5. 5 - T denotes the set of elements in 5 that are not in T, and the symmetric difference of 5 
and T is written as 5 0 T = (5 - T) U (T - 5). |5| is the cardinality of the set S. The sets of 
real and natural numbers are represented by 3fc and IN, respectively. The empty set is denoted 
by 0. 

If E is a (not necessarily finite) alphabet, then S* denotes the set of all finite-length strings 
of dements of S. If to € E*, then the length of to, denoted jtrj, is the number of symbols in the 
string w. Let J|W denote the set {w € E* : |to| < n}. All logarithms are in base 2. 

Define a concept cfos* to be a pair C= {C,X), where X is a set and CC2 1 . X is the 
domain of C and the elements of C are concepts, X can be thought of as a universe of objects, 
and each concept in C as the set of objects with certain properties. We are interested in the 
problem of determining which concept classes are learnable; that is, the problem of deciding 
which concept classes have learning algorithms. 

Since learning algorithms must be able to output their hypotheses, there must be some 
means of representing the concepts in C concisely (there is no requirement that the concepts 
be finite, so clearly representing a concept extensionally is not feasible). Thus we must define, 
in addition to the concept classes, some means of representing the concepts. 

Let the domain X be a set of strings in E*, for some alphabet S. We describe a context for 
representing concepts over X. 

Following (4, 75], define a class of representations to be a quadruple R = (R,T,c, S). S and 
T are sets of characters. Strings composed of characters in £ are used to describe elements of 
X, and strings of characters in T are used to describe concepts. R C T* is the set of strings 
that are concept descriptions or representation*. Let c : R — ► 2 s * be a function that maps 
these representations into concepts over S*. R may be thought of as a collection of names of 
concepts, and for any r 6 R, c(r) is the concept named by r. 
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For example, we might represent the concept class consisting of all regular binary languages 
as follows. Let E = {0, 1} and define £ to be the set of all deterministic finite state automata 
(DFAs) over the binary alphabet. T is the set of characters needed to encode DFAs under some 
reasonable encoding scheme, and c maps DFAs into the regular languages that they accept. 

As another example, suppose we wish to represent the concept class of Boolean formulas 
over the variables ...,«„• (That is, each concept is the set of n-bit binary strings that 
correspond to satisfying assignments of some particular Boolean formula over n variables.) One 
possible class of representations would be to let £ = {0, 1}, T = 2j, . . ., x ni A, V, -i, (, )}, R 
be the set of all well-formed Boolean formulas over *i, 23, . . . , z n (written using the characters 
in F), and c map each formula in R to the set of its satisfying assignments. 

To represent concepts over the real numbers, £ can be defined so that each of its elements 
corresponds to a different real number. Since it is likely that concept descriptions would also 
need to make reference to real numbers, T could also include all of the reals, and thus both S 
and r would be uncountable. 

For any o e S U T, |o| is denned to be 1, and thus if £ or T is an uncountable alphabet, 
such as the real numbers, then each number counts as one "unit", and we assume for clarity 
of exposition that elementary operations are executable in one unit of time. Our results also 
hold when the logarithmic cost model is considered, wherein elements are represented to some 
finite level of accuracy, and thus require space equal to the number of bits of precision. In 
this scheme, an elementary algorithmic operation on an element takes time proportional to the 
number of bits of precision. 

Note that if R = ( U, T, c, E) is a class of representations then there is an associated concept 
class C(R) = (c(JI),£*), where c(R) = {c(r) : r € R). Since the PAC-learnability of a class 
of concepts may depend on the choice of representations [61], PAC-learnability is in fact a 
property of classes of representations rather than of concept classes. 

For convenience, we write r(x) m 1 if x € c(r), and r(x) = 0 otherwise. We also write 
r in place of c(r) when the meaning is clear from context. Thus sometimes r denotes the 
representation of a concept, and sometimes it denotes the concept itself. However, whenever 
we refer to the site of r, denoted the length of the representation is always intended, and 
not the cardinality of the concept. An example of r is a pair (»,r(»)), where r(a) is the label 
of z. If r(x) = 1 then (x,r(x)) is a positive example; if r(x) = 0 then it is a negative example. 
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The length of an example (as, r(z)) is \x\. A sample of size m of the concept r is a multiset of 
m examples of r. We let JlM denote the set {r € R : jr | < $}. 

Ebr a class of representations, the membership problem is that of determining, given r 6 -R 
and x € whether or not x € c(r). We consider only classes of representations for which the 
membership problem is decidable in polynomial time; classes without this property would be 
of little use in a practical setting. Thus we only consider representation classes R — (jR, I\c, S) 
for which there exists a polynomial-time algorithm EVAL such that for all r 6 R and x G £*, 
EVAL(r,x) = r(x). EVAL runs in time polynomial in |rj and |x|. Such an algorithm is 
a uniform polynomial-time evaluation procedure; "uniform" refers to the fact that there is a 
single algorithm that can test membership for any concept in the class* 

A randomized algorithm is an algorithm that behaves like a deterministic one with the 
additional property that, at one or more steps during its execution, the algorithm can flip a fair 
two-sided coin and use the result of the coin flip in its ensuing computation. 1 In this thesis we 
make assertions of the form that there exist randomized algorithms that, when given as input a 
parameter 7 > 0, will satisfy certain requirements with probability at least 1 - 7- Without loss 
of generality we allow such randomized algorithms to choose one of m > 2 outcomes with equal 
probability. Such a choice may be simulated in time polynomial in m and A by a two-sided 
coin with a small additional error that can be absorbed into 7. 2 With this understanding we 
ignore this additional error in the arguments to follow. 

If R, = (J^I\c f S) is a class of representations, r e ft, and D is a probability distribution 
on £*, then EXAMPLE(Z?, r) is an oracle that, when called, randomly chooses an x G S* 
according to distribution D, and returns the pair (x, r(x)). 

The following definition of learnability (and minor variants thereof) appears widely in the 
literature of computational learning theory. (See, for example, [39, 67]; the essence of the 

x See ,'~Q] for a formal treatment. 

3 In order to limit the total probability of error to 7, we can five as input to the algorithm the parameter 
2, and bound the additional error introduced by the simulated coin flip* by the remaining £. For example, an 
algorithm simulating a single maided coin flip on flip a two-sided coin flog, m\ times, interpret the results as 
the Unary r ep resenta tion of an integer between 1 and 2^* f m \ and, if the result is b e tween 1 and m, use this 
value to make the choice* If m is not a power of 2 then there will be a nonaero probability that none of the m 
possibilities is chosen; in this case the process can be repeated up to 1 — log, 7 times until one of the m values 
is selected. The probability that no choice would hate been made after 1 — log) 7 iterations is no more than £. 
Thus the overall error bound of 7 is maintained with only a small polynomial increase in running time. 
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definition is from Valiant [73].) Ibr a more detailed discussion of this and other models of the 
learning problem, see [37]. 

Definition 2.0.1 The class of representations R = (J?, I\ c, S) is PAC-learnable if there exists 
a (possibly randomized) algorithm L and a polynomial px, such that for all n,s > l 9 for all € 
and 6 (with 0 < e,$ <l), for aUr 6 JlW, and for all probability distributions D over £H, if L 
is given as input the parameters s, €, and 6, and may access the oracle EXAMPLE(i?, r), then 
L halts in time pl(ti, s 9 1, y) and, with probability at least 1—6, outputs a representation r* £ R 
such that D(r* 0 r) < e. Such an algorithm L is a polynomial-time learning algorithm for R. 

Note that the algorithm is given an upper bound s on the size of the representation to be 
learned. However, any learning algorithm that receives such a bound can be replaced by one 
which does not receive this information, provided we allow the algorithm to halt in polynomial 
time only with high probability [37]. 

Note also that since L runs in time p^(n, s, ~, j ), any r # output must satisfy |r'| < pl(ti, a, |). 
We will frequently abbreviate w PAC-leamable w by "learaable" in what follows. If 

D(/©r)<€, 

then we say that r' is an e- approximation of r (with respect to I?), or is €- accurate for r (with 
respect to Z>), omitting the parenthesized phrase whenever D is clear from context. 

Thus PAC-learnability requires that a learning algorithm exists that, with high probability 
(1 - £), can produce an €- approximation of any unknown target concept from the class of rep* 
reservations being learned. Further, the running time (and hence the number of examples used 
by the learning algorithm) may increase at most polynomially in the inverse of the parameters 
t and £, and polynomially in the length n of each example and the bound s on the size of the 
representation of the target concept. 

We define some representation classes over the domain {0, 1}*. Given some n 6 IN, a literal 
is either the symbol or its negation 3* for some t such that 1 < i < n. In the following, let k 
be any fixed natural number. 

monomial** U n ^{rn : m is a conjunct of literals over n variables}. 

iDNF; ^-disjunctive normal form formulas = U n ^{r : r is a disjunct of monomials, each 
with at most k literals, over n variables}. 
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jbCNFi ib-conjunctive normal form formulas = U n6 j N {r : r is a conjunct of clauses, each 
containing at most k literals, over n variables}, where a clause is a disjunct of literals. 

fc-term-DNP: U n ^{r : r is a disjunct of at most » monomials over n variables}. 

ib-clause-CNF: U n ^jj{r : r is a conjunct of at most k clauses over n variables}. 

decision-lists: u n€ g^{DL : DL is a dedsion-lif * over n variables}, where a decision-list (over 
n variables, for any n 6 IN) is a list of pairs DL ~ ((mi,&i),*--,(m^,& ; )), where each 
mj is a monomial (over n variables) and each bi is either 0 or 1. The value of DL on 
x £ {0, l} n is denned algoxithmicaUv: let * be the least number such that x satisfies m<. 
Then DL(z) = bj (or 0 if no such i exists). 

A-decis ion-lists; U n ^j{DL : DL is a decision-list over n variables and each monomial in DL 
contains at most k literals}, 

The definitions of decision-lists and A-decision-lists are due to Rivest [66]. 

A common variation on this model is learning the class & in terms of another class H, in 
which the learning algorithm must output hypotheses from the representation class H, rather 
than R. Another variation is polynomial predictability; a class R ?s polynomially predictable if 
there exists a class H (for which membership caii be tested in polynomial time) such that R is 
learnable in terms of H. 

The following theorem presents some results that will be used in Chapter 4. 

Theorem 2.0.2 

/. Monomials are P AC -learnable [73]. 

2. For each k>l, kDNF is PAC-learnable [74 J. 

3. For each * > 1, hCNF is PAC-learnable [73]. 

4. For each k>l, k-decision-lisU are PAC-learnable [66]. 

5. For each k > \, k-term-DNF is PAC-learnable in terms of kCNF [61]. 

6. For each k > I, k- clause- CNF is PAC-learnable in terms of kDNF [61]. 
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In Euclidean domains, classes such as unions of rectangles and unions of half-spaces have 
been proven to be learnable [10, 12]. All n o nl e ar n ability results under the model described 
above depend on assumed hardness results from complexity theory or cryptography. Under the 
assumption that RP ^ NP, the classes of fc-tean DNF and i- clause CNF are not learnable [61]. 
Both of these classes are, however, polynomiaUy predictable, since each can be learned in terms 
of another representation class [61]. Each of the other classes mentioned in Theorem 2.0.2 is 
also polynomially predictable. 3 

A number of other interesting results on PAC-learning have been proved in the literature. 
See, for example, [12, 35, 46, 47, 54], and many of the papers in [39, 67]. 

In [11] and [12] it was shown that a sufficient condition for a representation class to be PAC- 
learnable is that there exist an Occam algorithm for the class. An Occam algorithm for a class 
H. = (JZ,r,c, S) is an algorithm that, when given a finite sample of any concept in R, outputs 
in polynomial time a description of a "simple* concept in the class tht t is consistent with the 
given sample. (A concept r* is consistent with a sample of the concept r if the examples in the 
sample that are in r' are exactly those that are in r.) Depending on the domain, the definition 
of simple measures either the number of bits in the concept description [11] or the complexity 
of the class of possible hypotheses output by the algorithm, as measured by a combinatorial 
parameter called the Vapnik-Chervonenkis dimension [12]. An Occam algorithm is thus able 
to compress the information contained in the sample. If such a compression algorithm exists, 
the representation class is PAC-learnable. We define Occam algorithms formally in the next 
chapter. 

PAC-learning, as well as models that follow the general description given above, is a model 
of supervised learnkg. In supervised learning there is a teacher (the oracle EXAMPLE(Z), r), 
in the case of PAC-learning) that gives the learner examples, with each example labeled as to 
whether it is in the target concept. Depending on the particular model of learning, the teacher 
may give the learner additional information about the target concept as well. Unsupervised 
learning models the situation in which there are no a priori underlying concepts to be learned, 
but rather the objective of the learner is to partition the elements of the domain in a manner 
consistent with some predetermined criterion. This approach is also known as clustering, and 
has been studied extensively. 

'See [3T, $4, 63] for comparisons of this and other models of prediction. 
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In Arthur-Merlin games [5], interactive proof systems [32], and games against nature [60], 
a "prover" and a "verifier" interact in a ™*rm** somewhat similar to that of the teacher and 
learner in supervised learning. Under these protocols, as in supervised learning, one of the 
parties (the prover) supplies information to the other party (the verifier) in an attempt to 
elicit a desired response. As compared to models of learning, the pro vers in these protocols are 
allowed considerably greater computational power than are teachers, and have fewer restrictions 
on the type of information they may communicate to the verifier. 
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3 OCCAM ALGORITHMS AND PAC- LEARN ABILITY 



In this chapter we prove that for many natural concept classes the existence of an Occam 
algorithm is also a necessary condition for PAC-learnability. In particular, we show that PAC- 
learnability is equivalent to the existence of Occam algorithms for concept classes that are closed 
wider exception lists (defined in Section 3.2). Consequently, for such classes PAC-learning is 
equivalent to compression, either in terms of the number of bits in a concept description or in 
terms of the Vapnik-Chervonenkis dimension. 

3.1 Occam Algorithms 

Occam's razor, which asserts that "entities shoul*' be multiplied unnecessarily" [58], has 
been interpreted to mean that, when offered a choice among hypotheses that describe a set of 
data, the shortest hypothesis is to be preferred. Unfortunately, when applied to the problem 
of finding a concept that fits a sample, finding the shortest hypothesis u often computationally 
intractable [11, 36, 61]. It has been shown that settling for a short hypothesis, as opposed to 
the shortest one possible, is nonetheless an effective technique in the context of PAC-learning. 
Following [11], define an Occam algorithm to be a polynomial-time algorithm that, when given 
as input a sample M of the concept induced by an unknown representation r g R and a bound 
s on |r|, outputs a short (but not necessarily the shortest) representation r* in R such that r 
and r' are identical when only the strings in M are considered. We make this more precise. 

Let Sm,n, r — {M : M is a sample of sise m of r 6 R, and all examples in M have length 
at most n}. (Recall that if Af is any sample of r, then r* i% consistent with M if for every 
(«,r(a)) € Af, r'{z) = r(z).) Define *trings(M) to be the set {» : (*,r(a»)) € M). 

Definition 3.1.1 A randomised polynomial-time (length-based) Occam algorithm for a class 
of representations R = (R,T,e, 2) is a (possibly randomized) algorithm O such that there exists 
a constant a < 1 and a polynomial po, and such that for ollm,n,s > 1 and r € R&, if O 
is given as input any sample M C S m ^, r , any 7 > 0, and s, then O halts in time polynomial 
in m,n,s, and 1 and, with probability at least 1 - 7, outputs a representation r' £ R that is 
consistent with M and such that jr'j < po(n, s, i)m fl . 
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The above definition is a slight generalisation of that hi [11]. As in the definition of PAC- 
learnability, we may omit the upper bound s on |r| that is supplied to the algorithm if we 
are willing to allow the algorithm to halt in polynomial time only with high probability. Note 
that if the sample M is a set (but not a multiset) for which an Occam algorithm O finds a 
consistent r* meeting the required length bounds, thea O can be modified to ignore duplicate 
examples and thus output the same r' on input of any extension of M to a multiset Af\ Thus 
to show that an Occam algorithm performs as desired on a given multiset Af it is sufficient 
to show that it performs as desired on the set of distinct elements of M. Consequently, we 
assume without loss of generality that any sample M input to an Occam algorithm contains 
only distinct elements. 

The following theorem is a straightforward generalization of Theorem 2.3 of [11]. 

Theorem 3.1.2 LetK = (J2,r,c,E) be a class of representations, with T finite. If there exists 
a randomized polynomial-time (length-based) Occam algorithm for R, then R. is PAC-learnable. 

Theorem 3.1.2 generalizes the result in [11] by allowing the running times of learning algo- 
rithms and Occam algorithms to be polynomial in the example length n, and by allowing for 
randomized Occam algorithms. Similarly, the lengths of the hypotheses output by an Occam 
algorithm are now allowed to depend polynomially on n and 

Proofs The proof is similar to the one in [11], with minor modifications as follows. We 
are parameterizing the representation class by t°»th hypothesis size and example length, in- 
stead of just by hypothesis size. Thus each occurrence of the hypothesis size \r\ (denoted by 
n in [11]) should be replaced by the product of the bound s on the hypothesis size and the 
example length (sn in our notation). Both the Occam algorithm and the learning algorithm are 
given s as a parameter. Since an Occam algorithm can now be randomized, we allocate half of 
the permissible probability of error to the Occam algorithm itself (by giving it the parameter 
7 « |) and use the remaining $ to bound the probability that the output hypothesis has error 
larger than e. The latter is achieved by replacing each occurrence of £ in the proof in [11] by 
f . Thus the total probability of producing a hypothesis with error e or more is bounded by 6. □ 
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3.2 Exception Lists 



In the next section we prove the convene to Theorem 3.1.2 for all classes of representations 
that satisfy a certain closure property. The property dictates that a finite list of exceptions may 
be incorporated into any representation from the class without a large increase in size. More 
specifically, the class of representations must be closed under taking the symmetric difference 
of a representation's underlying concept with a finite set of elements from the domain. Further, 
there must exist an efficient algorithm that, when given as input such a representation and 
finite set, outputs the representation of their symmetric difference. 

Definition 3.2.1 A doss R = [R,T, c, S) is polynomially closed under exception lists if there 
exists an algorithm EXLISTanda polynomial p E x such that for all n > 1, on input of any r £ R 
and any finite set EC EM, EXLIST hubs m time psx{n t Jr|, \E\) and outputs a representation 
EXLIST[r, E) = r E €R such that e{r E ) = c(r) 0 E. Note that the polynomial running time of 
EXLIST implies that \r B \ < PEx{n,\r\,\E\). If m addition there exist polynomials px and p? 
such that the tighter bound \r E \ < p 1 (n,|r|,log|.E|) + p 2 (n,log}r|,log|£|)|£| is satisfied, then 
we say that R is strongly polynomially closed under exception lists. 

Clearly any representation class that is strongly polynomially closed is also polynomially 
closed. The definition of polynomial closure above is easily understood — it asserts that the 
representation r E that incorporates exceptions E into the representation r has size at most 
polynomially larger than the size of r and the total size of E, the latter of which is at most 
n\E\. The property of strong polynomial closure under exception lists seems less intuitive; we 
will motivate the definition after we prove that it is satisfied by the class of Boolean-valued 
circuits. 

Examples Circuits are strongly polynomially closed Consider the class of Boolean- 
valued circuits with n Boolean variables **, . . .x n as inputs, and consisting of binary gates A, 
and V, and unary gate denoting logical AND, OR, and NOT, respectively. Given such a 
circuit C, and a list E of assignments to the input variables, we describe a circuit C E that on 
input of any assignment o, produces the same output as C if and only if a g E. C E computes 
the exclusive-OR of two subdrcuits C and C The subdrcuit C has 0(n\E\) gates, and outputs 
1 if and only if the assignment to the input variables x u ...z n is in the set E. Clearly, C E has 
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the desired behavior. Let k be the number of gates in C. Then the number of gates in Ce is 

0(k + n\B\). 

We assume that each circuit is represented as a list of tuples of the following form. An OR 
gate g m = g v V g M is denoted by the quadruple (a, V, y, *), where x, y, and * axe binary strings 
denoting numbers used as names for gates, and the symbol u v" is in the representation alphabet. 
AND and NOT gates are handled similarly, as is the specification of the input and output. It 
follows that a string of O(fclogi) characters is sufficient to represent a circuit containing k 
gates. Thus if r and te are the representations for C and Ce above, we have 

\r E \ = 0((* + n|l?|)log(* + n|£|)) 

= 0{k\og(k + n\E\) + n\E\lot[k + n\E\)) 
= Pi(« T |r|,log| J B|) + j> 2 (n,log|r|,log|£JJ)|£| 

for some polynomials pi and p%. Thus the class of Boolean circuits is strongly pqlynomiaUy 
closed under exception lists. 

The above example is helpful in motivating the definition of strong polynomial closure 
under exception lists. Typically, a representation class is a collection of strings, each of which 
encodes some underlying mathematical structure (e.g., a circuit). Note that the intuitive size 
of the structure is not the same as the number of bits needed to represent it. In the case of 
a Boolean circuit, a natural measure of size is the number of gates and wires needed to build 
the circuit. Assuming bounded fan-in (as we have done), this is O(k) where k is the number 
of gates (including the input nodes). However, in order to encode the circuit description, we 
require 0(k log k) bits to name the gates and specify the connection pattern. 

In our construction of Ce from C above, all that was necessary was the addition of a new 
component C that checked membership in the set E. Then C and C were easily connected 
together to form Ce- Thus the size of Ce is roughly the sum of the size of C and the size of the 
exception list, the latter of which is n\E\. Strong polynomial closure under exception lists is 
meant to model exactly this situation — wherein a set E of exceptions can be incorporate*? into 
some structur e C by simply adding an additional substructure of size roughly the size of the list 
E. The two polynomials ps andps in the definition of strong polynomial closure under exception 
lists are meant to correspond roughly to the sizes of these two components in the structure which 
incorporates the exceptions. As noted above, there is a logarithmic discrepancy between the 
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intuitive size of the mathematical structure and the number of bits needed to represent it. 
Consequently, the polynomials have arguments which allow for logarithmic cross-terms such as 
jrj log |S|and |£|log|r|. 

Other Examples We give examples of a number of natural classes of representations that 
are strongly polynomially closed under exception lists. 

The property for Boolean formulas can be demonstrated as follows. Let T be a Boolean 
formula and E = {e}\ ej, . . . , et, C J" , ej , . . . , ej} be a set of exceptions, where the strings with 
tt + B superscripts satisfy T and the strings with M - w superscripts do not. Let /{*", ff, . .., ff , 
/i\/aV ••,//" be the monomials satisfied only by ej\ef,...,ef , e^.e, respectively. 
(Recall that a monomial is a conjunct of literals.) Strong polynomial closure under exception 
lists is witnessed by the formula denned by 

^ = (^/rv/ 3 "V...V/:)A(^) | A(^)A...A(^). 

Recall that a decision-list over n Boolean variables is a sequence of pairs 

&i)> fe), . . . , (m„ b,)) 

where each to,- is a monomial and each 6< is either 0 or 1. The value of a decision-list on a 
setting of the n Boolean variables is defined to be where i is the least numbt. ~uch that m,- 
is satisfied by the assignment. (If no m< is satisfied, then the value is 0.) A set of exceptions E 
can be incorporated into a decision-list by adding to the beginning of the list a pair (m,, b t ) for 
each exception egf, where to, is satisfied only by assignment e, and 6, is 0 if e is accepted by 
the original decision-list, and 1 otherwise. This construction satisfies the requirements of strong 
polynomial closure under exception lists. Rivest [66] gives an algorithm for learning e-decision- 
lists, for each constant k, (Recall that the class of e-decision-lists consists of all decision-lists 
DL for which each monomial rm in the list DL has at most k literals.) It is not known whether 
the class of ^-decision-lists is strongly polynomially closed under exception lists. 

The reader may verify that the classes of decision- trees and arbitrary programs are strongly 
polynomially closed under exception lists, as is any class of resource-bounded Taring machines 
that allows at least linear time. 

Let n be the class of (hyper)rectangles with faces parallel to the coordinate axes in n~ 
dimensional Euclidean space. Let B(H) be the Boolean closure of Tcj that is, the class of 
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regions defined by unions, intersections, and complements of a finite number of elements of ft. 
It is easily shown that B(H) is strongly polynomiaUy closed under exception lists, using either 
the unit cost or logarithmic cost model and any reasonable encoding scheme. 

For any fixed alphabet S, the class of DFAs is strongly polynomiaUy closed under exception 
lists. However, if we consider DFAs over arbitrary finite alphabets as a single representation 
class, then strong polynomial closure does not appear to hold. In Section 3.5 an ad hoc argument 
is given that shows that the class of DFAs over arbitrary finite alphabets is PAC-learoable if 
and only if it admits a length-based Occam algorithm. The argument in Section 3.5 also shows 
that strong polynomial closure holds for any fixed £. 

There are some classes of representations, such as unions of axis-aligned rectangles in Eu- 
clidean space, that do not meet the above definitions of closure under exception lists but do 
have a weaker closure property that is also sufficient to prove the results of Sections 3.3 and 
3.4. This weaker property is discussed in Section 3.6. 

3.3 Results for Finite Representation Alphabets 

We consider the case in which the alphabet T (over which the representations of concepts axe 
described) is finite. This typically occurs when concepts are defined over discrete domains (e.g., 
Boolean formulas, automata, etc). Representations that rely on infinite alphabets (e.g., those 
involving real numbers) are considered in the next section. 

We show that strong polynomial closure under exception lists guarantees that learoability 
is equivalent to the existence of Occam algorithms. Theorem 3.1.2 states that if for the class of 
representations R = ( R, I\ c, E) there Is a randomised polynomial-time algorithm that, for any 
finite sample Af of r € -ft, outputs a rule describing which elements of strings(M) are in c(r) 
that is significantly shorter than the sample itself, then R is PAC-learaable. Thus if there exists 
an efficient algorithm that can compress the information about the concept c(r) contained in M , 
then the class of representations can be learned. The results of this section show that, for many 
interesting classes of representations R, if R is learnable then such a compression algorithm 
must exist. Thus not only is compressibility a sufficient condition for PAC-learnability, it is a 
necessary condition as wall. Hence Irani ability is equivalent to data compression, in the sense 
of the existence of an Occam algorithm, for a large number of natural domains. This answers 
an open question in [11] for many classes of representations. 
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Theorem 3.3.1 IfK = (Ji, T, c, E) ts strongly polynomially closed under exception lists and R 
is P/lC-ieornoWe, loen there easts a randomized polynomial-time (length-based) Occam algo- 
rithm for R. 

Corollary 3.3.2 ie< T 6c o finite alphabet. If TL = (JZ,r,c, E) is strongiy polynomially closed 
under exception lists, then R is PAC-learnable if and only if there exists a randomized polynomial- 
time (length-based) Occam algorithm for R. 

Proof of Theorem 3.3.1 audi Corollary 3.3.2 

Corollary 3.3.2 follows immediately from Theorem 3.1.2 and Theorem 3.3.1. To prove Theo- 
rem 3.3.1, let £ be a learning algorithm for R = (R,T,c, E) with running time bounded by the 
polynomial pi. Let EXLIST witness that R is strongly polynomially closed under exception 
lists, with polynomials Pi and ps as mentioned in Definition 3.2.1. Let a be a sufficiently large 
constant so that for all n,s,t > 1, and for aU e and 6 such that 0 < e,6 < 1, 



Let b be sufficiently large such that for all n, s, t > 1, and for all e and S such that 0 < e,6 < 1, 



(log«) 0+ * < x^+*. 

We show that algorithm O (Figure 3.1) is a randomised polynomial-time (length-based) 
Occam algorithm for R, with associated polynomial 



Since r 1 correctly classifies every x € strings{M) - E and incorrectly classifies every z € E, 
r' B is consistent with if. Since R is closed under exception lists, r' s £ R. 

The time required for the first step of algorithm O is bounded by the running time of X, 
which is no more than 







and constant 



2a + l 
2o + 2* 
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Algorithm O (Inputs: j; 7; M 6 S^r) 

1. Run the algorithm L, giving it the input parameters *, 
e = m"&*, and 5 = 7. Whenever L asks for a randomly 
generated example, choose an element 9 € *tring$(M) ac- 
cording to the probability distribution D{») = £ for each 
of the m (without loss of generality, distinct) elements of 
$tringr(M), and supply the example (a, r(a)) to L. Let r' be 
the output of L. 

2. Compute the exception hst E = {» € *trtups(Af ) : r'(») # 
r (a)}. The hst E is computed by running EVAX(r', 2) for 
each m € 'trtn^Af). (The algorithm EVAX is denned on 
page6.) 

3. Output r' s s EXLISTfr 1 , £). 

Figure 3.1: Occam algorithm derived from learning algorithm L 

which is polynomial in n,s,m, and i. Note that this immediately implies that jr'j is bounded 
by the same polynomial. 

For each of the m distinct elements x in strings(M), each of length at most n, the second 
step executes EVAL(r', »), so the total running time for step 2 is bounded by (km)p eval (\r% n), 
where k is some constant and p«,ai is the polynomial that bounds the running time of algorithm 
EVAL. Since |r'| is at most j>i(n, s, , i), the running time for the second step is polynomial 
in »,s,m, and 

Since EXLIST is a polynomial-time algorithm, the time taken by the third step is a poly- 
nomial function of |r*| and the length of the representation of E. Again, jr'j is polynomial in 
n, j, m, and ^, and the length of the representation of E is bounded by some constant times 
nm, since \E\ < m and each element x 6 E has size at most n. We conclude that O is a 
polynomial-time algorithm. 

To complete the proof; it remains to be shown that with probability at least 1-7* \ t 'e\ ^ 
po(n,s, i)m a . Since R is strongly polynomially closed under exception lists, 

Irfel < piCnJfllogl^D + psKlogjr'Uogi^Dl^j 

< Pi(»,P*(n, ^ r),log|-B|) + P2(n,log(pL(n,s J K ];)),lo&\E\)\E\ 
€ 7 e 7 
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< |(l^]£l)- + |(=fJ2l)' w . (3.D 

Since L is a polynomial-time learning algorithm for R, with probability at least 1 -8, D(r&r') < 
e. The probability distribution D is uniform over the examples in Af ; thus, with probability 
at least 1 - 6, there are no more than em elements x Gsirings(M) such that z 6 r © r'. Since 
£ = 7, with probability at least 1-7, 

|£?| < em = m'^ffl = m^r. (3.2) 

Substituting the bound on \E\ of inequality (3.2) into inequality (3.1), and substituting 
m~*+r for €, we find that with probability at least 1 - 7, 
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= j(T)i^f*^(T)'^ 



< <**(™)° (logm)° +6 m^r. 
Case 1: m < c a , 6 , then (logm) 0+ * < (logc 0 , J( ) 0+fr < (c a , 6 ) a +\ so 

1*1 < (c^) a+6 a6 (^) a+k ' 



m«+» 



< Po(n,s,i)m 0 '. 
7 

Case 2: to > c^, then by choice of c^, (log m) a+ * < to*^P*. Thus (log m) 0+fr m^ < 

771**+ 3 , SO 



< Po(n,*,-)m 0 , 
7 



completing the proof of Theorem 3.3.1. 



3.4 Results for Infinite Representation Alphabets 

In this section we extend the results of Section 3.3 to the case in which an infinite alphabet 
is used to describe representations of concepts. Such representations typically occur when the 
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domain X over which concepts are defined is itself infinite (for example, axis-aligned rectangles, 
or other geometric concepts in Euclidean space [12]). We note that Theorem 3.3.1 holds also 
for r infinite, bnt is of dubious interest because the converse (Theorem 3.1.2, which shows that 
the existence of length-based Occam algorithms implies PAC-learnability) holds only when T is 
finite. In the case of infinite T, a different notion of "compression" is needed; one based not on 
the length of the representation of a class of concepts, bnt rather on a measure of the richness, 
or complexity, of a concept class, called the Vapnik- Chervonenkis dimension (VC dimension). . 
The importance of the VC dimension and its relationship with PAC-learning was established 
in [12]. 

The VC dimension and Relevant Lemmas 

Recall that a concept class C is a pair C = (C, X), where C C 2 X . 

Definition 3.4.1 Let C = (C,X) be a concept class, and let S C X. Define U C (S) - {cD S : 
c € C}; thus Uc{S) is the set of all subsets of S obtained by taking the intersection of S 
and a concept in C. The set S is shattered by C if Uc(S) - 2 s . The Vapnik- Chervonenkis 
dimension (VC dimension) of C is the size of the largest finite set S C X that is shattered by 
C. If arbitrarily large finite subsets of X are shattered by C, then the VC dimension cf C is 
infinite. 

The following lemma restates parts of Proposition A2.5 from [12]. 

Lemma 3.4.2 If{C,X)hasVC dimension d, then for any finite setSC X, \U C (S)\ < \S\ d +\. 

Another lemma that we will find useful is one that bounds the VC dimension of a concept 
class induced by taking symmetric differences with sets of bounded size. 

Lemma 3.4.3 Let (C, X) have VC dimension d. Let C** = {c® E : c € C,E C X,\B\< /}. 
If 4 > 2 is the VC dimension of {C** t X), then ^ < d + / + 2. 

Proofi Let the VC dimension of (C^.X) be a\ > 2, and let P be a set of cardinality dj 
that is shattered by By definition, 

AW*)! = K(c 8 E) n P : c € C,B C X, \B\ < l}\ = 2*, 
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which implies 

\{(c ®{EnP))r\Picec,Ecx t \E\ < i}\ = 2*, 

and thus 

J{(c © E) n P : c € C, E C P, \E\ < l}\ = 2*. 
Since (c©£)flP = (cn?)®£ whenever E C P, 

|{(c n P) $ £7 : c € C, 5 C P, J£| < /}| = 2*. (3.3) 

But the left side of equation (3.3) is at most the product of |{c n P : c € C}| and |{£ ; E C 
P, |£| < /}|, which is jnc(P)| EU (I')- Thus 

Substituting the upper bound on |IIc(P)| from Lemma 3.4.2 into inequality (3.4), we obtain 

2* < ((4) d + l)E(?) 
< 2(d l f(d i y + \ 

and since di > 2 the above implies that 

di 

<d+/+2. 
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Recall that if R = (R, r, c, 23) is a class of representations, then there is a naturally associated 
concept class C(R) = (c(U), £*), where c(JZ) = {c(r) : r € 22}. The VC dimension of a class of 
representations R 3s denned to be the VC dimension of the induced concept class C(R). We 
write VC-dim(C) and VC-dim(R) to denote the VC dimension of the concept class C and the 
class of represe nt a tio ns R, respectively. 

Recall also that EM consists of strings of S* of length at most n, and that RW is the set 
of representations reR of length at moat s. If R = (J*, I\c, E), then we define a concept class 
Rn* consisting of elements of jRH considered only with respect to examples from EK This is 
accomplished by introducing a new mapping tn that interprets any representation r only with 
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respect to examples of length at most n. In particular, we define = {R^,T,Cn, £), where 
c„(r) = c(r)nSW. 

The next lemma is a minor variant of theorems appearing in [12] and [25). 

Lemma 3.4.4 Let R = (JZ, I\c,E) be a class of representations, and let d(n, s) be the VC di- 
mension of Rm». IfBLis PAC-learnable, then d(n, s) grows polynomiaUy in n and s. 

The only difference between Lemma 3.4.4 and a result in [12] is that the latter does not 
allow the learning algorithm to depend on * and thus the VC dimension grows polynomiaUy in 
n alone. The modifications to their proof needed to yield the above result are so minor as to 
be omitted. 

Dimension-based Occam algorithms and PAC-learnability 

When r is infinite, the existence of a length-based Occam algorithm is not sufficient to guaran- 
tee PAC-learnability. The proof of sufficiency in the case of finite T relies critically on the fact 
that for any given length n, there are at most |r| n distinct representations r € R of length n. 
Consequently the proof fails when T is infinite. In order to prove a result analogous to Theo- 
rem 3.1.2 that also holds fcr infinite r, Blumer et al [12] define a more general type of Occam 
algorithm, which we will refer to as a dimension-based Occam algorithm. As was the case with 
length-based Occam algorithms, the definition requires the algorithm to output simple hypothe- 
ses, but this time using a different definition of "simple". Rather than measuring simplicity by 
the size of the concept representation output by the Occam algorithm, this definition uses the 
notion of VC dimension to measure the expressibility of the class of concepts that the algorithm 
can output. The larger the VC dimension of the class of concepts, the greater the expressibil- 
ity, and hence the complexity, of that concept class. Thus instead of requiring the algorithm 
to output short hypotheses, the definition of a dimension-based Occam algorithm requires the 
algorithm to output hypotheses from a class with small VC dimension. The definition below is 
a slight variant of the definition in [12]. 

Definition 3.4.5 A randomised polynomial-time (dimension-based) Occam algorithm for a 
class of representations & = (R,T,c, S) is a (possibly randomized) algorithm O suck that for 
some constant a < 1 and polynomial po , for all m, n, * > 1 and 7 > 0, there exists Rm,n,»,y Q R 
such that yC-dsm((J2 m , fM , 7 ,r,c n ,E)) < j?o(n,s, ~)m a , and if O is given as input any sample 
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M C £nM>,r (where r € R^) and the parameters 7 and *, <Aen 0 halts m <tmc po/ynomiai in 
m,n, j, ana* ~ and, twtfc probability at least 1 — 7, trofpsi* a repfeseniafibn r* 6 Am,iM,7 that is 
consistent with M. 

As was the case for length-based Occam algorithms, we may omit the upper bound s on |r| 
that is supplied to the algorithm if we are willing to allow the algorithm to halt in polynomial 
time only with high probability. 

The following theorem is a straightforward generalization of Theorem 3.2.1(i) of [12]. 

Theorem 3.4.8 If there exists a randomized polynomial-time (dimension-based) Occam algo- 
rithm for the class of representations K, then R is PAC-learnable. 

Theorem 3.4.6 generalizes the result in [12] by allowing the running times of learning algo- 
rithms and Occam algorithms to be polynomial in n and by permitting the VC dimension to 
grow polynomially in n. The above theorem also provides for randomized Occam algorithms 
and allows the running time of the algorithm as well as the VC dimension of the class of possible 
hypotheses to grow polynomially in ~. 

Proof* The proof is similar to the one given in [12], with the following minor modifications. 
We are parameterizing the representation class by both n and s, instead of just b/ s. Because 
of this and the fact that randomized Occam algorithms are permitted, each occurrence of the 
polynomial p(s) in the proof in [12] should be replaced by p(n, s, ~). For the same reason, the 
effective hypothesis space (C^, in the notation of [12]) should be replaced by R m , n ^,- y , as de- 
nned above. Both the Occam algorithm and the V*ming algorithm are given s as a parameter. 
Finally, the parameter 6 in [12] should be split between the Occam algorithm itself (which is 
run with 7 = f ) and the bound on the probability that the output hypothesis has error larger 
than e, as described in the proof of Theorem 3.1.2. □ 

We prove the following partial converse to Theorem 3.4.6, which is analogous to Theo- 
rem 3.3.1 of the previous section. 

Theorem 3.4.7 If R = (JZ,r,c, E) is a class of representations that is polynomially closed 
under exception lists and R is PAC-learnable, then there exists a randomized polynomial-time 
(dimension-based) Occam algorithm for R. 
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Corollary 3.4.8 JfW = T, c,E)isa class of representations that is polynomially closed un- 
der exception lists, then R s* PAC-learnable if and only if there exists a randomized polynomial- 
time (dimension-based) Occam algorithm for R. 

Proof of Theorem 3.4.7 and Corollary 3.4.8 

Corollary 3.4.8 follows immediately from Theorem 3.4.6 and Theorem 3.4.7. Note that Corol- 
lary 3.4.8 holds regardless of whether T is finite or infinite. Note also that for dimension-based 
Occam algorithms we only need polynomial closure under exception lists, rather than the more 
stringent condition of strong polynomial closure that appears to be required to prove Theo- 
rem 3.3.1. 

To prove Theorem 3.4.7, let L . e a learning algorithm for R = (J2, T,c, S) with polynomial 
running time j>l- Let d\n, s) - VC-dim(R ntt ) be a polynomial whose existence is guaranteed 
by Lemma 3.4.4. Let EXLIST witness that R is polynomiaUy closed under exception lists. Let 
k > 2 be a constant such that for all n, s > 1, and for all e and S such that 0 < e, S < 1, 

d(n,px(n,s,i,i)) + 2<i(^) fc . 

Let a* be a constant snch that for all s > at, logs < x*^ 7 . 

To prove the theorem, it suffices to prove that algorithm O of the last section (with e 
of step 1 defined by e = mT instead of m~^) is in fact a randomized polynomial- time 
(dimension-based) Occam algorithm for R, with corresponding polynomial 

po{n,s,-) = a h k*& [™J 

and constant 

_ fc a + 2fe 

We have already argued in Section 3.3 that O runs in time polynomial in m, n, s, and ~. 
(This argument still holds since R is polynomiaUy closed under exception lists.) Clearly any 
r* s that is output by O is consistent with If. To complete the proof, we must exhibit a set 
Rm^»,t C R of VC dimension at most po(n, s, J)m fl such that with probability at least 1 - 7 
the output r' E of O is in the set Rm,n,»,-, whenever O receives as input the parameters 7 and s 
and a sample M of cardinality m, consisting of examples of length at most n of some r € of 
size at most s. 
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Define Rm,wr to be the set of representations r' E that O outputs on input of 7, s, and any 



in step 2 satisfies \E\ < e|M|. Thus the only time that O fails to produce an element of £«*,»,,,., 
is when the learning algorithm L fails to produce a representation r' that is correct within c on 
the learning teak at hand. This can happen with probability at most 6 = 7, so with probability 
at least 1 - 7 the algorithm O outputs an element of Rm^,,^ The following claim completes 
the proof of Theorem 3.4.7. 

Claim! The VC dimension of [R^n^m I\ c„, E) is at most po(n, s, i)m a . 

Proofs Let d R be the VC dimension of (2Zm,n,,, 7 , T, c„, £). The result is immediate if 
dn < 1. Assume dn > 2, Let the effective hypothesis space of L, denoted £n,#,«vr? be exactly 
those representations r' that £ might output on input parameters e, £ (= 7), 3, and randomly 
generated examples, each of length at most n, of some representation in JfcM. Since £ runs in 
time bounded by polynomial p L , each element of L^,^ has size at most pi(n, s, J, A), and thus 
.£n,,,. >7 C i2 tpi(n, ' , «^ )] . Consequently, the VC dimension of the class (I v , fi1 , r,c rl ,S)isat most 
the VC dimension of the class (aW«M»i$» r, ^ s). Recall that VC-dim(R a , 8 ) < d(n, s), and 
thus the VC dimension of (I„,,,,, 7 , I\ c, S) is at most d(n,p£(n,s, 7, £)). 

Note that each element r' E g JZm,*^ is obtained from the symmetric difference of some 
element r' of Z> n ,*,« 1 7 and some list of exceptions £ C SW of cardinality at most em. Applying 
Lemma 3.4.3 (with (C, X) equal to the concept class induced by (Zn,#,«, 7 , I\ Cn, 2), and / = em), 
we conclude that dn satisfies 



sample M of 5 m , n , r (where r is any element of HM), provided mar me exception list B obtained 



dn 




logdfl 



By our choice of Is, this implies that 




(3.5) 



Case 1: d fl < 04, then clearly a* < po(n,s,i)m a , and the claim is proved. 
Case 2: d fl > a k , then by choice of a*, logo's < (djj)*k. Thus 
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and, combining this with inequality (3.5) above we have 

Wtt < }(=)'♦- 

- 1 (?)"-* 



em 



Raising each side to the power J±§, we obtain 
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3.5 DFAs and Occam Algorithms 

As will be shown below, for any fixed alphabet E the class of DFAs denned over alphabet S is 
strongly polynomially closed under exception lists. This does not appear to be the case if the 
alphabet 2 is allowed to vary. Nonetheless, an argument very similar to Theorem 3.3.1 may 
be employed to show that the class of DFAs (over arbitrary alphabets) is PAC-learnable if and 
only if there is a length-based Occam algorithm for the class. 

We first define a class of representations that captures the problem of learning an arbitrary 
DFA. Let See = {<»o> «i» • • •} be a countably infinite alphabet. Clearly, for any finite nonempty 
alphabet 2 the problem of PAC-learning the class of DFAs over 2 is captured by the problem 
of PAC-learning DFAs over the finite alphabet {00*01, • • ••> a \L\~\}- Similarly, we can rename 
the states of Af to be qo, ft , . . .. Thus for any DFA Af to be learned, we assume without loss of 
generality that Af has the following form. For some 9 > 1, Af has states ft, . . . and for 
some <r > 1, M has alphabet {ao, oj, • • .a»-i}. 

The representation alphabet T consists of the symbols 0, 1, and several punctuation charac- 
ters. The re p r e s e n tation r of a DFA Af with a states and alphabet {00,^1 » . • •n«r-i} is a string 
r s z#t0#t, where x is a binary string of length flog s] indicating that the initial state is q m ; 
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to is a binary string of length a where the t-th bit (counting from 0) of w is 1 if and only if $ is 
an accepting state; and where t is a list of triples that represents the state transition function 6 
of M. The list t contains (*, j, k) if and only if «j) = g*, where * and k are binary numbers 
that are indices for states of Jlf, and j is a binary number that is an index into the alphabet 
{ao,. ..,o^_i}. Assume that s,cr > 2. 1 Then the size of the representation r satisfies 

|r| = flogs] + 1 + a + 1 + jo-(2flogs] + flog 9] + 4), 

and thus 

£ M < 12s0*logs©\ (3.6) 

Since V is finite, Theorem 3.1.2 applies, and the class of DFAs is PAC-learnable if it has a 
length-based Occam algorithm. We show the converse holds, resulting in the following charac- 
terization. 

Theorem S.S.I The class of DFAs is PAC-learnable if and only if there exists a randomized 
polynomial-time (length-baaed) Occam algorithm for the class. 

Proof: It suffices to show that a PAC-learning algorithm for DFAs implies the existence of 
an Occam algorithm. The proof is nearly identical to the proof of Theorem 3.3.1, but because 
DFAs do not seem to be strongly polynomially closed under exception lists, we need a more 
careful analysis. We first define a procedure EXLIST that witnesses polynomial closure under 
exception lists for arbitrary DFAs and strong polynomial closure for DFAs over any fixed fixate 
alphabet E. 

Let the representation r encode a DFA M - (Q t 2,$,®,^), where Q is the finite set of 
states, S is a finite alphabet, S is the state transition function, qo is the initial state of M , and 
F C Q is the set of accepting states. Let \Q\ = a. Let E be a finite set of strings of length at 
most n. Then EXLIST(r,E) is the encoding of the DFA M E that accepts L(M) $ E and is 
constructed as follows. Me contains as a subautomaton the DFA M plus some additional states 
and transitions. Me has a new start state q, and for each string w 6 E there is a deterministic 
path of new states beginning at q, labeled with the characters of to. (The union of all such paths 
forms a tree.) The last state of the path will be an accepting state if M rejects to, otherwise it 

l Jn the cmc that one or both a* j and <r is 1, the upper bound on |r| of (3.6) most be adjusted slightly. We 
omit tide adjustment in what fofrowa far clarity of presentation. 
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is a rejecting state. The other states in the path will be accepting or rejecting states depending 
on whether the string corresponding to the state is accepted or rejected, respectively, by the 
original machine AT. Each new state of Me is thus uniquely associated with a prefix of some 
string of E. If p is a new state of Me associated with some prefix w' of w 6 and if fcr 
some a € v'a is not a prefix of any string in E t then we mast indicate the state to which 
the transition from state p on input a leads. In this case, the transition leads back to the 
appropriate state of the original machine M\ i.e., S(p,a) = 6{qo,w'a). 

The number of new states is at most n|i?|+l, and thus the number of states in Me is at most 
s + n\E\ + 1. Consequently, if the representation te encodes Me, we have by inequality (3.6) 



Clearly EXLIST may be implemented to run in polynomial time. Further, if <r is treated as a 
constant, then by using the fact that |r) > so-, polynomials pi and ps are easily found such that 



Thus for any fixed alphabet 53 the class of DFAs over £ is strongly polynomially closed 
under exception lists. However, if we do not require S to be fixed, then c is not a constant and 
\te\ is not expressible in the desired form due to the term n\E\tr. 

Let £ be a PAC-leaming algorithm for DFAs, with polynomial run time pi, and let a > 3 
be a constant such that for all n,s,<r > 1, and for all e and 7 such that 0 < e, 7 < 1, 



We will show that algorithm O (as in Figure 3.1, with constant a defined as above) is an 
Occam algorithm, with polynomial po and constant a < 1 to be determined later. 

Let the DFA that r encodes have s states and an alphabet of <r symbols. Then in step 1 of 
algorithm O, the output r 1 of L satisfies 



\r S \ < I2(s + n\E\ + l)<rlog((* + n|f?| + l)<r). 



|r*| <j»i(n,|r| t bgiJl) + ji|(tt t log|r| l to||^|)ljB|. 




< pr(n,12s<rlogs<r,-,-) 

€ 7 
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The number of states in the DFA that r' encodes is at most |r'| and thus the number of 
states in the DFA encoded by output in step 3 is at most |r'| + n\E\ + 1. Consequently, by 
inequality (3.6), 

s 12 6 (~)° + n + 0 ((! (^)° + -w + 0 ') • 

By the same reasoning as in the proof of Theorem 3.3.1, inequality (3.2) holds with proba- 
bility at least I - 7; substituting for e and \E\ % it follows that 

\r' E \ < 12 ^| ("J") m ^ Tcr + »m=fr<r + <rj log ^| m^<r + nm^tr + vj 

< 12a m^T log m«TT^ . 

By algebraic simplification, it is easily shown that there is a constant c a such that 

\r' E \ < Ca ("~) + 

The constant Co is chosen so as to absorb other constants arising in the simplification, and such 
that for all m > c«, logm < ro>»+» . Since |rj > «r, for constant a = and for some polyno- 
mial po we have |r^| < j>o(n, |r|,i)m a , completing the proof that O is an Occam algorithm. □ 



3.0 Discussion 

Results in [11] and [12] show that the existence of Occam algorithms is sufficient to ensure 
that a class is PAC-learnable. In a sense, this means that if there is an algorithm that, for 
any concept in the class, can compress the information about the concept contained in any 
finite sample of that concept, then the class can be learned. We have proved that not only 
are randomized Occam algorithms a sufficient condition for learnability, but they are in fact 
a necessary condition far classes that are closed under exception lists. Thus he existence of 
randomised algorithms exactly characterises PAC-learnabuity for a wide variety of interesting 
representation classes. Ear such classes, learning is equivalent to information compression, in 
the sense just described. 
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Extensions 

The definitions of closure under exception lists in Section 3.2 require that there exists an 
algorithm EXLIST that, when given as input a representation r € R and a finite set E, outputs 
a representation r E £ R such that c(i»s) = c(r) ® E. This condition is, however, stronger 
than necessary to prove Theorems 3.3.1 and 3.4.7. These proofs rely only on the fact that the 
class is closed under exception lists with respect to a finite sample: It is only necessary that 
EXLIST output a representation rg such that c(r^) n strings{M)'= (c(r) $ E) n strings[M); 
that is, such that c{rs) and c(r) © E agree on all strings in a given finite sample, though not 
necessarily on aU strings in the domain. The definitions in Section 3.2 are presented because 
they seem to be more natural properties of representation classes. However, since the weaker 
definitions of closure are also sufficient to prove the existence of randomized Occam algorithms, 
it is possible to show that such algorithms exist for a wider range of representation classes th*n 
satisfy the hypotheses of Theorems 3.3.1 and 3.4.7. (In particular, when concepts are denned 
over continuous domains this weaker closure property should be much easier to satisfy.) 

An example of such a class is the class of unions of axis-aligned rectangles in the Euclidean 
plane, which appears not to be closed under exception lists as denned in Section 3.2. This class 
is, however, polynomially closed under exception lists with respect to finite samples, and is thus 
learnable if and only it admits a dimension-based Occam algorithm. This can be seen as follows. 
A positive exception can be added to any concept in the class by adding to the union a rectangle 
that includes the exception, but is small enough to exclude each of the negative examples in 
the sample. To handle negative exceptions within some rectangle, for each such exception draw 
narrow horizontal and vertical bands which form a cross and include the exception in the center. 
The bands should be narrow enough so that no positive example in the sample is included in 
both the same vertical band and horizontal band as the exception. Then take the union of all 
of the vertical regions bounded by the vertical bands and all of the horizontal regions bounded 
by the horizontal bands. This new union (of a number of rectangles linear in the number of 
exceptions) covers everything except a small box around each exception. 

Recall from Chapter 2 the notion of learning one representation class 3L = (i£,I\ c, 2) in 
terms of another repres e n tation class B/ = (Bf,T , ,d, S) (originally introduced in [61]). Under 
this definition, a learning algorithm for & is required to output hypotheses in R 1 , rather than 
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R (of course, Rf may be a superset of H). Several interesting representation classes that are 
not PAC-learnable have been shown to be learnable in terms of other classes (see, for example, 
[1, 36, olj). One may generalise the notion of an Occam algorithm to that of an Occam 
algorithm for a class R in terms of another class R' in a straightforward way. Analogues of 
Theorems 3.1.2 and 3.4.6 prove that the existence of an Occam algorithm for R in terms of R' 
implies that R is PAC-learnable in terms of R'. 

The results of Theorem 3.3.1 and Corollary 3.3.2 can also be extended to the case of learning 
R in terms of R'. The definit'an of closure under exception lists is adjusted so that EXLIST, 
when glren as input r 6 R and a finite set E C £H, outputs a representation r' E € Rf such 
that c'(rk) » c(r) © E. It is then easily shown that if a class R is strongly polynomially closed 
under exception lists in terms of a class R', then the existence of a PAC-learning algorithm for 
R in terms of R' implies the existence of an Occam algorithm for R in terms of R'. It is not 
known whether Theorem 3.4.7 and Corollary 3.4.8 can be generalized in this manner. 

As denned in Chapter 2, a representation class R is polynomially predictable if there exists 
some representation class R' with a uniform polynomial-time evaluation procedure (i.e., an 
algorithm EVAL as denned in Chapter 2) such that R is PAC-learnable in terms of R'. If 
there is such a class R\ then there is also a class R" that is strongly polynomially closed 
under exception lists, and such that R is PAC-learnable in terms of R". (The concepts of 
R" are simply the concepts of R' augmented with finite lists of exceptions. Clearly R" is 
strongly polynomially closed under exception lists, and since R" contains all concepts of R', R 
is PAC-learnable in terms of R".) Thus, by the analogue of Corollary 3.3.2 just discussed, R 
is polynomially predictable if and only if there exists a randomised polynomial-time (length- 
based) Occam algorithm for R that can output as its hypotheses the concepts of any class with 
a uniform polynomial-time evaluation procedure. 

Related Work 

Schapire [70] has proved a significantly stronger compression result in the particular case of 
polynomial predictability in discrete domains. He shows that if a class R over a discrete 
domain is polynomially predictable, then there is a polynomial-time ^l gwi-itVp i that, given any 
finite sample of a concept in R, outputs a description of the sample that has size at most 
polynomially larger than the smallest possible consistent description. 
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Schapire also considers the notion of weak predictability. A class is weakly predictable if 
there exists a learning algorithm that will, with high probability, produce a hypothesis that is 
correct only slightly more (by an inverse polynomial) than half of the time. He then proves the 
surprising result that if a class is weakly predictable then it is also polynomially predictable 
under the regular PAC model. Thus by his result above, it follows that »n discrete domains a 
class is weJdy predictable if and only if it has a randomized polynomial-time Occam algorithm. 
By combining our observations above with his result that a class is predictable if and only if 
it is weakly predictable, we can obtain the same result for continuous domains as well. These 
results demonstrate a relationship between two seemingly quite disparate properties of a class of 
representations; If there is an algorithm that can learn a hypothesis that is only slightly better 
than random guessing, then there exists another algorithm that can find small hypotheses 
exactly consistent with finite samples from the class. 

The notion of an Occam algorithm can be relaxed to that of an approximate Occam algo- 
rithm. More formally, define a randomized polynomial-time (length-based) approximate Occam 
algorithm (RPTLBAOA, pronounced "reptile-boa") for a representation class R = (JZ, I\c, £) 
to be a randomized algorithm that, when given a finite sample M of some representation r g RW 
and parameters €,7 < 1, and s, outputs in time polynomial in n, s, |M j, J, and i a representa- 
tion r' € R such that with probability at least 1 - 7, r' is consistent with at least (1 - t)m of 
the examples of M , and such that |r'| < p 0 (n, M» J, 7)"*°* where m is the cardinality of M, po 
is some fixed polynomial, and a < 1 is some fixed constant. Thus a RPTLBAOA is identical 
to a length-based Occam algorithm, except that rather than finding a consistent hypothesis, 
the algorithm is allowed to find a hypothesis that is approximately consistent; the hypothesis 
may err on e of the sample. Implicit in [45] is a proof of the following gen e rali za tion of The- 
orem 3.1.2: If a class of concepts R has a RPTLBAOA, then the class is PAC-learnable. It 
is a straightforward observation that the converse holds, i.e., that if a class is PAC-learnable, 
then it has a RPTLBAOA. This converse holds regardless of whether the class is closed under 
exception lists. Thus, the results of [45] implicitly show that PAC4earnability is equivalent to 
the ability to find small approximately consistent hypotheses for a sample in random polynomial 
time. 

Another result concerning data compression and PAC-learning is due to Sloan [71]. Sloan's 
result demonstrates that regardless of the class, PAC-learnability implies the ability to find 
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exactly consistent hypotheses from the same class that axe slightly compressed. In particular, 
he shows that if a class is PAC-learnahle, then there is a constant k and an algorithm O such 
that for sufficiently large m and n, if O is given as input any sample of cardinality m of examples 
of length at most n, then O will output with very high probability a hypothesis that is consistent 
with the sample and that has size at most (1 - n~*)m. This slight compression does not appear 
to be enough to guarantee PAC-learnability, whereas the compression by more than a linear 
amount that is guaranteed by Occam algorithms (and by Schapire's result) is sufficient. 

It is interesting to note the similarity between samples of concepts from classes for which 
there exist length-based Occam algorithms and strings with low Kolmogorov complexity ([52]; 
see [2, 53] for more recent results). Ear each there exists a short algorithm that encodes the 
information contained in a longer string of characters. 

Other Implications 

Suppose that, for some class of representations R that is closed under exception lists, there is 
an algorithm L that is a learning algorithm for R provided that the probability distribution 
assigns nonzero probability only to a finite number of strings in the domain, and assigns the 
same nonzero probability to each such string. Note that the construction of the algorithm O 
in Section 3.3 only requires that the learning algorithm wo:k for uniform distributions over 
finite samples. Thus the existence of X is sufficient to construct a randomized polynomial-time 
Occam algorithm for E. This in turn implies that R is PAC-learnable. Hence for many natural 
classes, in order for the class to be learnable under arbitrary probability distributions over the 
entire domain (PAC-learnahle) it is only necessary that the class be learnable under uniform 
distributions over finite subsets of the domain. This observation is due to Manfred Warmuth. 

Consider classes of representations R = (J2,I\e, £), not necessarily closed under exception 
lists, with the following pr op ert y: R = u«.R«, where each r e Rn is denned over examples 
of length n only. (Representation classes of Boolean formulas typically have this structure.) 
Suppose farther that there exists a polynomial p such that for all n and all r 6 -R», jrj < 
p(n). We say that such a class is polpnomiaUy size-bounded. A number of restricted classes of 
representations are polynomiatty size-bounded, including ^decision-lists, fe-term DNF formulas, 
*- clause CNF formulas, ftDNF formulas, and fcCNF formulas, where k is any constant. (General 
DNF formulas are not polynomiatty size-bounded.) For any polynomiatty size-bounded class 
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R, if & is PAC-learnable then the size of the hypothesis output by the learning algorithm 
X is always hounded by p(n). Thus for any finite sample M of m examples, if X is given as 
input » xf M rcpi«* from M f drawn randomly according to a uniform distribution, and the accuracy 
parameter c of X is set to a value less than then with probability at least l-S X will output a 
hypothesis of size polynomial in n that is consistent with M, This is a randomized polynomial- 
time (length-based) Occam algorithm for EL Thus for any polynomially size-bounded class, 
even if it is not closed under exception lists, learnability is equivalent to the existence of a 
randomized (length-based) Occam algorithm. 

Currently, all results known of the form M R is not PAC-leaxnabk unless BP = NP n rely on 
certain syntactic restrictions on the class & [62]; such results rely on a proof that it is NP-hard 
to determine whether there exists any hypothesis from the class R that is consistent with a 
given finite sample. This technique cannot be applied to show the intractability of learning any 
class that is syntactically rich enough to allow the expression of disjunctions of singletons, since 
in this case a consistent hypothesis for any sample is easily obtained. Consequently, the non- 
PAC-learnability of DNF or DFAs cannot be proved in this manner. Our results may provide 
a new technique for proving nonleamability results that rely only on the assumption that BP 
5* NP. 3 Fox example, in Section 3.5 we showed that the class of DFAs is PAC-learnable if and 
only if it admits a length-based Occam algorithm. Such an Occam algorithm would provide a 
very weak approximate solution to the minimum consistent DFA problem; partially negative 
results in this regard have been demonstrated [64] which, if extended appropriately, would show 
that no Occam algorithm for DFAs is possible, and consequently no PAC-learning algorithm 
for DFAs is possible unless RP = NP. 

An obvious open problem is to determine whether Theorems 3.3.1 and 3.4.7 can be proved 
using weaker conditions than closure under exception lists. The exception list property is 
satisfied by any class that (1) contains all singleton concepts, and (2) is (polynomially) closed 
under set union and subtraction. It would be of interest to determine if either of these conditions 
can be dropped. In particular, classes such as DNF admit union (via disjunction) but do not 

3 In the cue of DFAs, Sens and Valiant [48] (see also [44]) ihow nonpred ictab i H ty bated on number- theoretic 
■ml i iyptopepMi ■seassptfaas llial am mli niflilj rtaaf thin the trldesjtead complexity theoretic aaanmptioa 
that the chutes BP and NP an different. 
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appear to admit set difference; thus they do not appear to be closed under exception lists. Is 
the PAC-learnability of DNF equivalent to the existence of an Occam algorithm for DNF? 
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4 SEMI-SUPERVISED LEARNING 



In this chapter we ask whether it is possible to learn with less information than is provided in 
the standard PAC-learning model - without a teacher labeling examples of each concept to be 
learned as positive or negative. Further, we consider the problem of simultaneously learning a 
collection of concepts, instead of just a single one. 

There are (at least) two situations that we might wish to model that involve learning in an 
environment with no teacher and many concepts to be learned. One is to assume that there are 
no a priori underlying concepts against which the learner is to be evaluated, and that the goal 
is to partition the examples in a manner consistent with some predetermined criterion. This 
approach is traditionally known as clustering, or unsupervised learning, and has been studied 
extensively [3, 17, 24, 34, 69]. 

The other approach, which is undertaken in this chapter, is to assume that there are in 
fact specific concepts to be learned, yet there is no teacher labeling each element as to its 
concept membership. In this case, the criterion of success is how well the .earned concepts 
approximate the correct underlying concepts. Of course, in the absence of any information about 
the underlying concepts, and without a predetermined criterion for measuring the suitability 
of a clustering, the learning task is impossible. If, on the other hand, there is a teacher who 
labels each element with its corresponding concept name, then (for any reasonable definition 
of concept learning) the simultaneous [supervised) learning of a disjoint collection of concepts 
trivially reduces to separate instances of learning individual concepts. For each concept, the 
positive examples will then be the members of the concept, and the negative examples will be 
members of the other concepts. 

We strike a compromise between these two extremes, and investigate the simultaneous 
learnability of a collection of concepts in a semi-supervised manner, Le., with partial information. 
Bather than assuming that concept labels axe given, we assume instead that there is an oracle 
that, upon request, will randomly and independently choose two examples according to an 
unknown probability distribution over the domain and tell the learner whether or not the two 
examples belong to the same concept. A possible interpretation or justification of such an oracle 
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is a learning environment in which the leaner is able to occasionally and randomly notice that 
two examples ought to be classified together (or apart), yet does not necessarily have the ability 
to relate these two examples to other examples previously seen, or likely to be seen. 

If there is only one concept to be learned, then the problem is closely related to a form 
of concept learning in which the teacher, rather than providing randomly chosen positive and 
negative examples, instead answers whether two randomly chosen examples are of the same type, 
i.e., both positive or both negative, without telling which is the case. Thus the learnability of 
a single concept in a semi-supervised manner is an interesting question itself, as it explores the 
boundary of the amount of information that is necessary for concept learning. It would seem 
that if, in addition, the examples were from many different concepts to be learned simultaneously 
in a semi-supervised manner, then the learning problem would be significantly more difficult. 

We show that in fact for a wide range of representation classes R (in which concepts are 
represented by of Boolean formulas) known to be PAC-learnable, and for every constant t > 0, 
any collection of t disjoint concepts denned by formulas of R is learnable in a semi-supervised 
manner (ss-learnable) in polynomial time. 

Sufficient conditions are also given for the ss-leamability of representation classes of finite 
Vapnik- ChexvonenHs dimension. In particular, it is shown that if R has finite VC-dimension 
and R is learnable from positive examples only, then any collection of t disjoint concepts from 
R can be ss-learned in time polynomial in t. 

Of particular interest is a new technique of learning an intermediate oracle. Many repre- 
sentation classes would be ss-learnable if we were to assume the existence of an oracle that, 
when asked about two examples, tells us whether or not they are examples of the same concept. 
We do not, however, wish to assume the availability of such an oracle. Since we have access 
to pairs of examples labeled as to whether or not they are in the same concept, in many cases 
we can use these examples to learn a concept description that will imitate the desired oracle 
quite accurately. We call this concept description an intermediate oracle. Once learned, the 
intermediate oracle can be used in place of a "real" oracle. We expect that this technique will 
prove useful for other l**™ing problems. 

The rest of this chapter is organized as follows. In Section 4.1 we review the necessary 
background and define ss-leamability. Section 4.2 gives an algorithm for polynomial-time ss- 
learning of monomial concept classes. Section 4.3 gives sufficient conditions for ss-leamability 
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which, in Section 4.4, are used to prove the ss-learnability of other classes of Boolean formulas. 
In Section 4.5 the ss-learnability of a concept class is related to the VC-dimension of the class 
and additional sufficient conditions are given. In Section 4.6 an ostensibly different definition 
of ss-learning is given and shown to be equivalent to that of ss-learnability. 

4.1 Notation and Definitions 

For each n > 1, let X n = {xi,x 3 ,...,x n } be a set of n Boolean variables. Define a fam- 
ily of Boolean formulas to be a representation class F = (F,T,c,2) where £ = {0, 1}, T = 
{A, v, -i, (, x 2 , . . .}, and F is a set of Boolean formulas described by strings of characters 
in T in the obvious way. Ibr any n > X, let T n € T denote the set {A,V,-t,(,),Xi,x 2 ,...,x n } 
and F n £ F denote the set of formulas in F that contain only symbols in V n . For each / € F, 
f represents the concept c(f) = {x € {0, 1}* : f{x) = 1}. 1 

As mentioned in Chapter 2, we occasionally write / in place of c(f) when the meaning 
is clear from context. Similarly, the word "formula" is sometimes used to denote "concept 
represented by the formula". 

We now naturally extend the definition of FAC-learning to the ss-learning of t disjoint 
concepts. Let t € IN. Let F = (F,r,c, E) be a family of Boolean formulas, and for some n 
let /i, ft . . . , ft 6 F n be pairwise disjoint (i.e., the sets of satisfying assignments of the ffs are 
disjoint). Let D be any probability distribution on {0, l} n such that 

D{vUifi) = 1. (4.1) 

Thus the only elements that may occur when sampling from D are those which satisfy one (and 
hence exactly one) of the /j's. 

Let LABELED-PAIRS dj x ,/*._,/» be an oracle that, when called, randomly and indepen- 
dently chooses two elements e,y € {0, l} n according to the probability distribution D and 
returns (x, y,sar,te) if, for some s, x and $ both satisfy /,-, or returns (x, y, different) if x and y 
satisfy different fbrmnl&s in {/*, ft . ..,/*}. When D and /i, ft . . ., /» are clear from context, 
we omit the subscripts and write only LABELED-FAIRS. 

'Thus c(f) contains strings of length n or mate. In particular, to any string m of length n in c(/), ell strings 
that have x as a prefix axe also in e(f). We will restrict the prohabiHty distribution D in the definition of senu- 
nxperrocd l««i^i»g so that D(x) a 0 for all a not of length exactly n; thus in any particular '""""g problem 
only strings of length n are considered. 
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An alternative definition would permit D to generate examples that are not in any of the 
fi*a. F-nrover, it is then more difficult to find a natural definition of LABELED- PAIRS. (It is 
not dear how a pair (*, y) should be labeled if one or both elements are not in any concept /,.) 

In trder to define the learning of a collection of t formulas F = {/ lt /j...,/ t } in a semi- 
supervised manner we need to measure the error of a collection of u formulas 0 = {01,02. ..,<?„} 
with respect to T and a given distribution D. The definition is obtained by the following 
intuitive considerations: Ideally, u = t and there is a correspondence between elements of 
Q and T such that each g% is an approximation of some unique formula /_,-. However, it is 
conceivable that more or fewer than the true number of formulas are learned, and thus there 
may be no correspondence between some of the formulas in 0 and some of the formulas in 7. 
Let Q* C 0 be the set of those formulas in Q for which there is in fact a corresponding formula 
in F. We measure the error of Q in the following way: A string x £ {0, l} n is an error point if 
any of the following conditions hold. 

Error- 1 x is not in any # € Q'. The intention is that G' contains the relevant learned formulas 
— those corresponding to the underlying formulas in T. Thus any point falling outside 
of the region VQ' should be counted towards the error. 

Error- 2 x is in the symmetric difference of some g% € Q' and the corresponding fj. This counts 
as error any discrepancy between a learned formula and the corresponding underlying 
formula that it is intended to approximate. 

Error-3 x is in the intersection of two different concepts in Q. This prohibits excessive overlap 
among the learned formulas. 

To formally restate the above regions of error, let / : Q' -+ F be an injection mapping 
learned formulas in Q' to their corresponding underlying target formulas in T. Then 

• E\ = UfiP 

• ©'(*)) 

Note that these regions might more accurately be denoted by E 1 !, E l % and JB J 3, since they 
depend on the injection J. In order to simplify notation the su pe rscripts are omitted. 
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Now we may define how closely a finite collection 9 of formulas approximates a finite col- 
lection T of formulas. 

Definition 4 1.1 Given ? = {A, /*...,/*}, 9 = {9u9*".,9u}, and a distribution D on 
{0, 1}" satisfying equation (4.1), then Q is e-dose to T if there exists a subset 9' CQ and an 
injection 1:9' -» T such that 

D{ElUE2UE*)<c. 

The motivation for counting the regions El and E2 as error associated with 9 should 
be cl ax. The purpose of region E3 is to preclude the possibility of making Q so large that 
the difficulty in the learning problem becomes determining the subset Q' that has the desired 
properties. For example, if the third error component was omitted, any family F of formulas 
could be ss-learned (as denned below) by simply setting 9 = so that 9 contained formulas 
representing every possible concept over n variables. Limiting the amount of overlap among 
the concepts of 9 appears to be the best of a number of possible solutions to this problem. 

m 

Finally, the following definition of ss-learnability essentially parallels that of PAC-learnability, 
except that the information available to an ss-learning algorithm consists of LABELED-PAIRS, 
and the algorithm is required to output a collection of formulas that is e-dose to the un- 
derlying collection of formulas. For any collection of formulas 7 - { A, /:,..-, /t}, define 
maxsize(^) = \U\ such that for all ; < t, \U\ > (Thus maxsize(^) is the size of the 
largest formula in !F.) 

Definition 4.1.2 The family F = ( F,T 9 c, E) of Boolean formulas is ss-learnable (learnable in 
a semi-supervised manner) if for each t 6 IN there exists an algorithm A and polynomial p such 
that for all n, s > I, for every disjoint collection f = >,/<} C F n with MXXSlZE(f) < s, 

for any probability distribution D on {0,1 } n satisfying equation (4.1), and for all €,6 > 0, if A is 
given as input the parameters e, S, and s and may access the oracle LABELED-PAIRS d *a t 
then in time p(n, s, J, J) A outputs a collection {ft,02. . .,?«} Q F n viith probability at 
least l-S, is e-close to {/i, /a . . . , /«}• Such an algorithm A is an ss-learning algorithm for the 
class F. 

Note that this di*f!nitiwi does not require that the learned formulas be disjoint, even thong;' i 
the formulas that they approximate are disjoint. However, any area of overlap among the 
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leaned formulas contributes towards the allowable error (region EZ). Similarly, the onion of 
the learned formulas is not required to exhaust the instance space, but any region that is in 
one of the original formulas but in none of the learned formulas is counted towards the error 
(region E\), 

Note also that we allow the algorithm to depend on the number of formulas t to be learned. 
In particular, the run time need not be polynomial in U Of course, it would be preferable 
to have an algorithm that always runs in time polynomial in t, but we have not been able 
to extend our results in this manner. This is similar to the problem of PAC-learning where 
for many concept classes (e.g. DNF), although it is not known whether the class as a whole 
is PAC-learnable, positive learnability results have been found for subclasses in which some 
measure of the concept size is assumed to be bounded by a constant (e.g. JfeDNF). 

4.2 Semi-supervised Learning of Monomials 

In this section we assume that the formulas to be learned are monomials m\ , rug • • . , m% over 
n variables. We show that, given access to randomly generated pairs of strings from concepts 
denned by monomials, labeled only as to whether or not they are members of the same con- 
cept, we can learn monomials that accurately describe the concepts in polynomial time. In 
Sections 4.3 and 4.4 we generalise the techniques to ss-learn other collections of formulas. 

Theorem 4.2.1 Let F be the family of monomials. Then F is ss-learnable. 

The general idea of the proof of Theorem 4.2.1 is as follows. Suppose we have an oracle that 
can tell us, for any pair of n-bit strings x and y, whether * and y are in the same concept, i.e., 
satisfy the same monomial. Then we can learn a collection Q of monomials that satisfies the 
definition of ss-learnability. This can be done by collecting enough individual points, say m of 
them (obtained by m/2 calls to LABELED-PAIRS), to ensure that, with high probability, we 
have at least one representative from each monomial of significant weight with respect to the 
distribution. Then, using our assumed oracle, we can query each of the (*J) pairs of these points 
to see which of them are in the same concept, and use the results to build equivalence classes. 
We can then use the m points as examples (posMve or negative, depending on the particular 
concept) with which to learn the monomials defining membership in the particular concepts. 
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Thus m most also be large enough so that, with high probability, the learned monomials are 
sufficiently accurate. 

We don't have such an oracle; what we do have is LABELED-PAIRS, which can give us 
some and different labels, but only for randomly generated pairs of points, not for requested 
pairs. We cannot just wait for the pairs that we're interested in to be generated, since D may 
be such that this would take exponential time. In order to get around this problem we learn an 
oracle from the *»™™ r l»« of sameness and differentness supplied by LABELED-PAIRS. Again, 
it might take too long to learn an oracle that responds correctly on all possible inputs; we 
instead learn an approximate oracle, and guarantee that the approximate oracle is accurate 
enough so that, with high probability, it will be correct on each pair of points in our sample 
of size m. (We call such an oracle an intermediate oracle, as it is not supplied to the learning 
algorithm; instead, it is constructed and used by the learning algorithm as an oracle enabling 
a solution of the proper form to be discovered.) 

Definition 4.2.2 For any collection of monomials fn l ,...,m t over the variable set X n (i.e. 
monomials in F n ), the concept SameConcept({mi t ...,m t )) Q {0,1} 2 " is the set of all binary 
strings of length 2n such that the first n-bit substring and the second n-bit substring are in 
the same monomial concept; i.e., if x and y are n-bit strings and their concatenation zy € 
SameConcept((m u . . .mt)), then for some i, 1 < i < t, * 6 m* and y 6 m». When clear from 
context, we omit the argument (mi,...,mt) and refer to the concept as SameConcept. 

Note that the oracle LABELED-PA TRS D<fWl im ,.„, m , is a generator of positive and negative 
examples for the concept SameConcept ({rr^, mt)) (with product distribution D 2 : X^ n — » 3? 
denned by D 2 {z) = D(x)D(y), where z = xy). 

Define the set of concepts Cs ~ U„ {SameConcept ({mi, m*)): tn\ , . . . , mt are monomials 
over the variable set X n }. Consider the concept class (Cs, £*). 

Lemma 4.2.S There is a representation class S for (Cs, £*) that is PAOlearnable in terms 
oftCNF. 

Proofi Let x and p be n-bit binary strings, and let z be their concatenation, zy. Suppose 
that z € SameConcept((mi,..., mt)). Then s,y 6 m for some i < t. For each i < t, let the 
monomial (over In bits) m\ be denned such that for all j < n, 

(xj € mi) A (*„+, € m-) Xj 6 m< 
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and 

fa 6 mj) A (SW+j € mj) * 7j € m<. 

Thus 2 must satisfy mj. la fact, the strings that satisfy mj an exactly those strings that are 
concatenations of pairs of strings that both satisfy m,-. Thus the set of pairs of strings that are 
in the same concept is the set of strings whose concatenations satisfy the t-term DNF expression 
m[ Vm, V...Vm[. This means that the pairs of points that are in the same (different) concept 
are the pairs whose concatenations are positive (negative) examples of a f-term DNF expression 
over 2n variables. Let S be the r ep r esen tation class of f-tenn DNF formulas. Then any concept 
in Cs can be represented as a formula of S. Although for each r > 2 it is NP-hard to learn an 
e-accurate f-term DNF expression from examples [61], t-term DNF is PAC-learnable in terms 
of rCNF (Theorem 2.0.2). Hence S is learnable in terms of t CNF. □ 

Thus we can obtain (with probability at least 1 - S) a tCNF expression that is ^accurate 
for m\ V ... V mj in time polynomial in n, J, and j. The tCNF expression will be used as an 
oracle SC for the concept SameConcept in the learning algorithm given below. 

Definition 4.2.4 For any q,0 <q < 1, a 9 -significant concept (or monomial) m* € {mi, . . . , m t ) 
is one for which D(mi) > q. Elements of {mi,...,mt} that are not q-significant are q- 
insignificant. 

Note that 

mt b «/2t .htrignfflrwrt mt b c/» -hx^g&ifie«nt iml 

The ss-learning algorithm is sketched In Figure 4.1. 

Theorem 4.2.5 Let the variable m in the Monomial 3$-Learning Algorithm be such that 

r 2t.4t . 2* 4t.. 
m = max{— m y ,pa(n, y, j)}, 

where p? is the (polynomial) time bound on the algorithm for P AO-learning a monomial of n 
variables with accuracy parameter £ and confidence parameter |-. Then the Monomial ss- 
learning Algorithm is an ss-learning algorithm for the class of monomials. 
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Monomial ss- Learning Algorithm 



1. Use LABELED-PAIRS to Iran, with probability at least 
1— £ 9 an oracle SC for SaineConccpt that ii j^r-accurate. A 
leaning algorithm exist* by Lemma 4.2.3. The time taken is 
at most pi(2n, ^1, }), where p^n, }, }) is the (polynomial) 
time needed to team a tCNF expression of a variables with 
accuracy e and confidence 6* 

2. Make $ calls to LABELED-PAIRS to obtain, with probabil- 
ity at least an m-sampie (a sample of sise m) containing 
at least one element from each ^-significant monomial 

3. Use SC to divide the m-sample into equivalence classes in 
the obvions way. 

4. Fbr each equivalence class, label the m-sample accordingly 
and run an algorithm for PAC-learning monomials, with ac- 
curacy and confidence parameters jf and respectively. 
Whenever the algorithm requests an example, give it the first 
unused example from the m-sample. Each monomial output 
by the learning algorithm is output as a concept description. 

Figure 4*1: Algorithm for ss-leaxning monomials 



Clearly Theorem 4.2.1 follows from Theorem 4.2.5. To prove Theorem 4.2.5 we need the 
following definition and lemmas. 

Definition 4.2.6 A good run of the Monomial ss- Learning Algorithm is one in which 

(A) The oracle SC learned tn step I is in fact a -approximation of SameConcept; 

(B) The m-sample obtained in step 2 does have at least one element of every —-significant 
monomial in {m^mi^ ..,mt}; 

(C) In step 3, SC makes no mistakes m dividing the particular m-sample into equivalence 
classes; and 

(D) Each learned monomial concept from step 4 of the algorithm u in fact an 

of the (teal) monomial that covers the elements of the equivalence class labeled as positive 
examples. 
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Lemma 4.2.7 Let M — {mi^m? . , . , m^} be the collection of disjoint monomials to be learned. 
If a good run of the Monomial ss-Leorning Algorithm occurs, and Q is the set of monomials 
produced, then Q is e-close to M. 

Proofi Set Q' (as described in Definition 4.1.1) equal to Q. Let I be the injection mapping 
each learned monomial g 6 0 1 to the (real) monomial of M that is consistent with the labeling 
of the m- sample that was used to learn g, (By (C), there is exactly one such monomial.) Since 
Q' = Q, and since each z such that D(z) > 0 is in exactly one m< € {mi, m 2 . . m*}, then for 
any *, j such that and g, are in Q' and t £ j, 

z € 9i ng; => z g 9i © /(#) V * 6 g s © /(#). 

Thus 

£3 = |J (tff>#)C |J (*©'(*)) = £2, 
so to show that £' is e-close to Ai it suffices to show that 

I?(jElU£?2)<e. 

By (D), for each g 6 Q\ g © /(?) < ^, and since there are at most t elements of Q\ 

D{E2) = D{[J *©/(*)) <4 = l* 
lb show that 0' is e-close to M, it thus suffices to show that 

D(E1 -£2)<|. (4.3) 
By equation (4.2), equation (4.3) is true if 

EI-E2C \J{mi : mi is e/2t- insignificant}, (4.4) 

which is true if 

El D \J{mi : mi is e/2t-significant} C £2. (4.5) 

To see that containment (4.5) holds, note that if m 6 rm and m< is ^significant, then by 
(B), there is some g € Q' such that 1(g) m mi. If z is also in El = U3 7 , then * g so 
* €0©I(0)££2. □ 
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Lemma 4.2.8 Let SC be a ^-approximation of SameConcept (with respect to the product 
measure D*). If we randomly generate m points (from f calls to LABELED-PAIRS then 
with probability at least 1 - J, SC will correctly classify each of the (™) pairs of points as to 
sameness and differentness. 

Proofi Let SC be a 5^ approximation of SameConcept, and thus 5^3- accurate with respect 
to the oracle LABELED-PAIRS. Consider any m- sample generated randomly from y calls to 
LABELED -PAIRS . Number all of the pain of points in the m- sample from 1 to CJ). For 
each i from 1 to let n be the probability that the oracle SC is wrong on the > th of 
the CJ) pairs. Since, for any m-sample, each permutation of the m points is equally likely, 
(V»,i < fj)) 7,- = -jj. In particular, (V« < (£)) j t = 7t- Note that 71 is the probability that 
a ^3- accurate oracle SC is wrong on the first pair. Thus the probability that SC is wrong on 
some pair is at most 

£*-(7)**(7)»4 

□ 

Lemma 4.2.9 If m is as in the hypothesis of Theorem 4.2.5 men the probability of a good run 
of the Monomial $s-Learning Algorithm is at least 1 — 6. 

Proofi Let A,B,C, and D represent the events that (A), (B), (C), and (D) (as given in 
Definition 4.2.6) occur respectively. The probability of not obtaining a good run is then 

Pr(3) + Pr(il n IB) + Pr( A n B n V) + Pr( A D B n C n 27). 

• Pr(A*) is at most f , by the definition of FAC-leamability. 

• By hypothesis, m > ?m^f, thus the probability that B fans to occur, i.e., that some 
^significant monomial does not have a representative in the m-sample, is at most 

Hence Pr(4 n ~S) < Pr(B) < J. 
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M • By Lemma 4.2.8, the probability that (C) fails to occur given that (A) occurs is at most 

f, thus Pr(An£n^) < Pr(iln^) < Pr(2|A) < j. 

■ • Given that A,B, and C occurred, the probability that a particular learned monomial has 
Mi error more than 4 is at most by the definition of PAC-learnability and the fact that 

■ each time one of the monomial learning algorithms requested an example it was given an 

■ example from the m- sample that was generated randomly according to D. Because m 

■ is at least as large as the time bound on die monomial learning algorithm, no algorithm 

■ requested more than m examples. Since there were at most t monomial equivalence 

■ classes obtained from the m- sample, the probability that (D) mils to occur, i.e., that 

■ some learned m o n om ial has error more than is at most £. Thus ?v(AnB n C n"S) < 
™ Pr(2?|A A B A C) < f . 

■ Thus the probability of not obtaining a good run is at most | + 1 + | + i = s*. □ 

■ To complete the proof of Theorem 4.2.5 (and hence the proof of Theorem 4.2.1), note that by 
Lemma 4.2.9 the probability is at least 1-6 that a good run occurs, implying (by Lemma 4.2.7) 

I that the set of monomials output by the Monomial ss- Learning Algorithm is e-close to the set 

of monomials to be learned. □ 

4.3 A Sufficient Condition for ss- Learning 

B In the proof of Lemma 4.2.3, the pairs of u-bit strings that were generated by LABELED - 

m PATHS were concatenated into a single 2n-bit string. It was then shown that the representation 

| class corresponding to pairs labeled as "same" was learnable in terms of tCNF. Notice that all 

B that was in fact required was that the concept of "sameness" be polynomially predictable (as 

I denned in Chapter 2). To apply this technique in general we will need the following definition. 

I Definition 4.3.1 Let F = {F,T, c,S) be a family of Boolean formulas. Then define 

• FFin = {/(*i»*j» •••>**») A *n+a»- • -,*2n) : / € F n ), Le. {he collection of formu- 
| las over 2n variables obtained by conjoining two copies of any formula f from F n , such 

that each variable Xj app e aring m the second description is changed to « n+J -. 

B • V t FF 3n =z{f l Vf t \/...vf t :f i eFF, nj l<i<t}. 
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• v t FF = u ndN v« FFfn. 

• FFi = (V t .Fir,r, Ci?F t ,S) ; i.e. a representation claw tn t»ft*cA tAc ^unctfon cff, maps 
formulas m V t fF to their set* of satisfying assignments. 

Note that for any collection of formulas 7 - {/i, ft .. ft} C F n , the size of the formula 

• • •» *n) A /i(*„+i , • • *2n)) V (/a(*i, . . ., 3S«) A /a(*»+i, . • • , «2n)) V . . . 

V(/ t (*l, . . . , *«) A / t (a5»+l , . • - , X2n)) € V FF *n 

t 

is at most 

((2*)maxsize(JF) + 1 + (t - 1) + 2t) flog 2n] < (60MAXSlZE(^)flog2n]. 

Theorem 4.3.2 If F w PAC-leamable and FF% is polynomially predictable then F is ss- 
learnable. 

Proofs The proof is a straightforward generalization of the proof of Theorem 4.2.1. We 
outline it here. Assume the existence of A, a PAC-learning algorithm for F, that runs in time 
bounded by Pji(»,<»i,£, J), where pa is a polynomial, / € F is the target formula, and Si is 
the input parameter bounding the size of /. We also assume the existence of B, an algorithm 
that predicts FFt in time at most ps(2n,s 2 , 1, J), where ps is a polynomial, ff t is the target 
formula, and sa b the input parameter bounding the size of fft. 

To ss-learn t disjoint concepts fi, f 7 . . . , ft of F, examples of SameConcept({/i, /»)) 
are constructed by forming the conjunct* of the pairs of strings output by LABELED-FAIRS. 
The number of such examples needed is at most is the running time of B. The as-learning 
algorithm is given as a parameter «, which bounds maxsizb(^). Since SameConcept € Vt FFim 

jSaraeConceptj < (6t)MAXSlZB( J 7 ) flog 2n] < 6tsflog2n]. 

Thus the rupning time of B is no more than ps(2n, 6t»(log2n], 3 ^-, }), so at most this 
many examples are required in order for B to learn the intermediate oracle. B is then 
rum it is given these examples whenever it calls the oracle EXAMPLE(-P 2 , SameConcept), 
along with the parameters n, ts = j£r» &B = {, and the size parameter 6tsflog2n], where 
m = maa{&mf ,pji(n,*, 2, f )}. Then 2?, in time at most p*(2n, 6tsflog2n], J), outputs 

48 



some polynomial- time algorithm M that is used as an oracle for the concept SameConcept. A 
set of m points is then randomly generated, using LABELED-PAULS, and A£ is applied to 
every pair of the m points to obtain equivalence classes. The algorithm A is run once for 
each of the equivalence classes, using the m points as examples, the error parameters *A = ji 
and Sji - ~, and the size parameter s. The time needed is again at most polynomial in 
all of the relevant parameters. An analysis identical to the one for the monomial case yields 
that the concepts learned by A are e- close to the true concepts with probability at least 1-6. □ 

4.4 ss-Learning Other Boolean Formulas 

The sufficient conditions of Theorem 4.3.2 are applied to show as corollaries that for each k, 
the families of &DNF, iCNF, and Indecision-lists are ss-learnable. 

Corollary 4.4.1 For any constant k, the family of k CNF formulas is ss-learnable. 

Proofs If F is the family of 4CNF expressions, then for each n, FF 7n is a family of *CNF 
expressions, since the conjunction of two fcCNF expressions is also a fcCNF expression. Then 
VtJFT, the disjunct oft 4 CNF expressions, may be represented by a tkCNF expression without 
more than a polynomial increase in size, since t and k are constants. To see this, let the 
disjunction be 

1ml 

where each Ei is a JfeCNF expression. This is equivalent to the tifeCNF expression 

A( c * Vc » V---Vc A ) 

where the conjunction is taken over all possible choices of clauses £ E 1 ,c ji € £2»"*,C£ € E t . 
Thus we can learn FFt in terms of tkCHF expressions (such expressions are PAC-learnable by 
Theorem 2.Q.2). Consequently, FF» is polynomially predictable. Since the family F itself is 
PAC- le a r na ble , the result fallows from Theorem 4.3.2. Q 

Corollary 4.4.2 For any constant k, the family of kDNF formulas is ss-learnable. 
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Proofs Let P be the family of JbDNF expressions. For each n, FF 7n is a set of 2&DNF 
expressions, since the conjunct of two fcDNF expressions can be described by a 2&DNF ex- 
pression. The collection of disjunctions of t such expressions, v t FF, is also a set of 2fcDNF 
expressions. Thus PPt can be PAC-learned in terms of 2fcDNF expressions using any of the 
algorithms of [12, 35, 54, 74]. Since 2&DNP expressions are thus polynomially predictable, and 
F is PAC-learnable, the result follows from Theorem 4.3.2. □ 

Corollary 4.4.3 For any constant k, the family of k-decision-Usts is ss-learnable. 

Proofs Let P be the family of Jfe-dedsion-lists. Any formula in the set FF 2n can be described 
by a 2*-dedsion-list as follows. The monomials of the new list are formed by conjoining one 
monomial from each of the two original lists; they are given a label of 1 if both of the con- 
stituent monomials are labeled with l*s, otherwise they are given a label of 0. The new labeled 
monomials are then sorted so that 

1. Every monomial with first half occurs before every monomial with first half m ; if and 
only if m» occurs before mj in the first decision-list. 

2. for all monomials with the same first half, every monomial with the second half m* 
occurs before every monomial with second half mi if and only if m% occurs before mi in 
the second decision-list. 

The disjunction of two Jb-decision-lists can be represented by a 2i-decision-list which is 
constructed m a manner similar to the conjunctive case. Thus the disjunction of t 2fc-decision- 
lists can be represented by a 2tt-dedsion-li8t, which is PAC-learnable (Theorem 2.0.2). Thus 
PP t is PAC-learnable in terms of 2t&-decision-lists, and is therefore polynomially predictable. 
Since P is PAC-learnable, the result follows from Theorem 4.3.2. □ 

4.5 Unparameterised as-Learning and the VC-Dimension 

As seen from Theorem 4.3.2, in order for our ss-learning algorithm to be applied successfully 
to a PAC-learnable representation class F, it is sufficient that the dais PPt be polynomially 
predictable. In this section we give sufficient conditions for the class PPt to be polynomially 
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predictable when the representation class F is over an unpararncterized domain and has finite 
Vapnik- Chervonenltis dimension. 

Thus fax we have exclusively discussed representation classes F — (F r I\ c, S) in which the 
set F consists of Boolean formulas, and hence is implicitly denned as an infinite collection of 
subclasses parameterized by n, the number of variables in the formula. Similarly, we have 
implicitly parameterized F by Jrj, the size of the formula to be learned. For any fixed n and 
|r|, the PAC-learnability of any subclass of formulas of size |rj over n variables is uninteresting, 
because there are at most a finite number of possible formulas, and a naive exhaustive search 
technique can be shown to successfully PAC-leam. However, nontrivial learning problems do 
arise over a single unparameterized domain when the domain is infinite. For example, if the 
domain is the Euclidean plane, we may be interested in the learnability of concepts represented 
by rectangles with sides parallel to the coordinate axes. For such learning problems, we adopt 
the convention that examples are described by single characters from some alphabet X, rather 
than strings of characters from X. (This alphabet will be denoted by X, rather than S, in order 
to indicate that this convention is in effect.) Thus X is the domain of the learning problem. 
We define these problems formally. 

Definition 4.5.1 Let BJ = (J£',I", d,X) be a representation class. A representation class 
B. = (R,T t c,X) (over the domain X) is pdynomially learnable in terms of &' if there ex- 
ists a (possibly randomized) algorithm A and polynomial p such that, for all r 6 R, for every 
probability distribution D onX, and for all c, 6 > 0, if A is given as input the parameters € and 
6, and may access the oracle EXAMPLE(D,r), then in time p($,$) A outputs some r 1 e R' 
such that with probability at least I -6, D(r©r') < e. If K is polynomial^ learnable in terms of 
itself, then & is polynomiaUy learnable. If there exists any class R' (for which the membership 
problem is decidable in polynomial time) such that & is learnable m terms of &', then & is 
polynomiaUy predictable. 

Thus polynomial learnability is similar to PAC-learnability, except that the domain and size 
parameters have been eliminated and the running time of the learning algorithm is not allowed 
to depend on n or \r\. Similarly, the definitions of learning one class in terms of another and 
polynomial predictability for unparameterized concept classes only differ from the definitions 
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In the parameterized case in a like manner. Note that the definition of polynomial learnability 
makes no allowance for randomised algorithms. 3 

Similarly, we define ss-leamability for a single (unparameterized) class of representations 
over a single (onparameterized) domain. The definitions of e- closeness and the oracle LABELED- 
PAIRS are generalised to the case of representation classes in the obvious way. 

Definition 4.5.2 The representation class R = (-R,I\ c, X) is ss-learnable (learnable in a semi- 
supervised manner) if for each positive integer t, there exists an algorithm A and polynomial p 
such that for every disjoint collection {n, *z • • • » *t} £ R > f° r any probability distribution D on 
X such that 

D{{Jri) = l, 

iml 

and for all c,6 > 0, if A is given as input the parameters e and 6, and may access the oracle 
LABELED-PAntSjp, n ,„..„ rf , then in time p(£, j) A outputs a collection {n lf A 3 ...,M C R 
that, with probability at least 1 - S, is e-close to {r l5 r 2 . . .,r t }. Such an algorithm A is an 
ss-leaniing algorithm for the class R. 

From Theorem 4.3.2, we saw that (in the parameterized case) the polynomial predictability 
of FFt plays an important role in the application of our general technique. Similarly, for 
any (unparameterized) representation class R the polynomial predictability of the associated 
representation class RRt (denned below) will be relevant. 

Definition 4.5.5 Let C = (C,X) be a concept class and let R be a representation class for C. 
Then 

• for any ci,c 2 6 C, the concept c% X c a (over domain X X X) is the concept {{x,y) ; x € 
CuV€c 2 }. 

• C X C is the set of concepts {c\ x e 2 : c u c 2 6 C}. Let C x C be the concept class 
{CxC,Xx X). 

m CC is the set of concepts {ex c : c 6 C}. Let CC be the concept class (CC, X x X), and 
let RR be a representation class for CC. 

'The remit* of this Mctkn can be extended to representation dance learnable by randomised algorithm* for 
classes that contain the concept* f , X, and {r} for each r € JL 
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• V t C it the set of concepts {c\ V c* V • • • V Ct : a € C}. let Ct 6e the concept class 
(VfC, X x X), and lei &t be a fepresentatson class for Ct. 

• V t CC is the set of concepts {ci V c 2 V • • • V Ct : € CC}. let CCt 6e tfce concept das* 
{V t CC,X xX), and let RB* fie a representation doss for CC t . 

Theorem 4.5.4 Let C be a concept class such that R is polynomiaUy learnable and RB* is 
polynornially predictable. Then R is ss-learnable. 

Proofi Similar to the proof of Theorem 4.3.2, omitting the size and domain parameters. □ 

We will refine the sufficient condition of Theorem 4.5.4 by incorporating sufficient conditions 
for the polynomial predictability of RR*. la order to do this, we will rely on a characterization 
of the polynornially learnable representation classes due to Blumer et al [12]. la order to state 
the relevant necessary and sufficient conditions for polynomial learnability we first review some 
definitions. 

Recall from Chapter 3 that the Vapnik-Chervonenkis dimension (VC-dimension) of a con- 
cept class (C, X) is the size of the largest finite subset of the domain X that is shattered by C. 
The VC-dimension is infinite if arbitrarily large subsets of X are shattered, Recall also that 
the VC-dimension of a representation class R is the VC-dimension of its induced concept class 
C(R). 

Definition 4.5.6 IfK — (i£ i r,c,«X') is a representation class, then a randomized polynomial- 
time hypothesis finder for R is a randomized polynomial-time algorithm that takes as input a 
finite sample of a concept m R and, for some i > 0, with probability at least 7 outputs some 
representation r £ R that is consistent with the sample. (Recall that a representation r is 
consistent with a sample if every positive ezamp'e in the sample is an element ofc(r) and no 
negative example is an element of c(r).) 

The following theorem is a special case of Theorem 3.1.1 in [12]. 

Theorem 4.5.6 IfK is a representation doss, then R is polynomiaUy learnable if and only if 
the VC-dimension ofBLis finite and there is a randomized polynomial-time hypothesis finder 
for B. 
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The following is a slight variant of Theorem 3.2.4 from [12]. 

Lemma 4.5.7 #R is a representation class that is polynomioMy learnable, then R* is polyno- 
mially predictable. Farther, the time required is polynomial intas well as \ and J. 

Proofi Modify the proof of Theorem 3.2.4 of [12] in a straightforward manner to allow for 
a randomized polynomial-time hypothesis lander, instead of a deterministic one. □ 

We now prove a sufficient condition for ss-learnability of an (unparameterized) representation 
class. 

Theorem 4.5.8 If C is a concept class such that its representation class R is polynomially 
learnable and there easts a randomized polynomial-time hypothesis finder for RR, then R is 
ss-learnable. 

Proofi By Theorem 4.5.4 it is sufficient to show that RRt is polynomially predictable. Since 
R is polynomially learnable, by Theorem 4.5.6 R has finite VC- dimension. By Lemma 4.5.9 
below, RR also has finite VC- dimension. This, together with the hypothesis that RR has 
a randomized polynomial-time hypothesis finder (and an application of Theorem 4.5.6 once 
again), implies that RR is polynomially learnable. Finally, applying Lemma 4.5.7 with RR in 
place of R, we conclude that RRt is polynomially predictable. □ 

Lemma 4,5.9 JfC (and thus K) has (finite) VC-dimension d, then CC (and thus JUL) has 
(finite) VC-dimension at most 4dlog6. 

Proofi for any concept class C = (C,X), let VCdim(C) denote the VC-dimension of 
C. Note that CC C C x C, so clearly VCdim(CC) < VCdim(C x C). We show that 
VCdim(C xC)< 4dlog6. 

Define C x X to be the set {c x X : c € C}, and let C x X be the concept class (<7 x 
X,X x X). Similarly, define X x C = {X X c : e e C}, and let X X C be the concept class 
(X x C,X X X). We claim that VCdfan(C x X) = VCdim(X x C) = VCdim(C). We only 
show that VCdim(C x X) - VCdim(C). The proof for X x C is virtually identical 
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To see that VCdfan(C X X) > VCdim(C), note that if 5 C X is a set of points that is 
shattered by <7, then S x {»} is shattered by C x X, for any particular point z £ X. To show 
that VCdJm(C x X) < VCdim(C), let {(*i,yiM* 2 ,y 2 ),...{*4,w)} b « shattered by C x X. 
Observe that for every c x X € C x X, and for all z,y,y' € X, we have € c x X if and 
only if (a$, y*) € c x X. Thus we can replace . .gu with any single point y 6 X, and then 

{<*i!yM*8»y)»' i» shattered by C x X. Let 5 = {*»,*a» ••.»*!*}• Since 5 x {y} is 
shattered by C x X, for every T C 5, there is a c 6 C such that (c x X) n (5 x {?}) = T x {y}. 
This is true if and only if c n 5 = T. Thus for every T C 5, there is some c € C such 
that c n 5 ss T and thus 5 is shattered by C. Since |5| = |5 x {y}j, this demonstrates that 
VCdim(C x X) < VCdim(C), and thus completes the proof of the claim that VCdhn(C X X) = 
VCdim(C). 

Finally, for any concept classes Ci - (Ci,X) and C 2 = (C 2 ,X), define the internal in- 
tersection (denoted n) of C\ and C 2 by Cj n C 2 = {ci n c 2 : ci 6 Ci,c 2 € C 2 }. Let C\ n C 2 
denote the concept class (C* n C 2 ,X X X). Lemrua 3.2.3 of [12] shows that if C has VC- 
dimension d, then C n C has VC- dimension at most 4dlog6. A virtually identical proof shows 
that Ci n C 2 has VC- dimension at most 4rflog6, for any two concept classes C 1? C 2 each 
with VC- dimension d. This result, together with our claim above, shows that the concept class 
(C x X) n (X x C) has VC-dimension at most 4dlog6. To complete the proof of the lemma, 
note that C x C = (C x X) n (X x C); thus VCdim(C x C) < 4dlog6. □ 

As an example, the represen ta tion class of axis-aligned rectangles in the Euclidean plane 
satisfies the hypothesis of Theorem 4.5.8, and thus is ss-learnable. 

As a final sufficient condition, we show that if & is a representation class over an unpa- 
rameterised domain that is polynomially learnable from positive examples alone, then R is 
ss-learnable. For simplicity of exposition, the definition below of learnability from positive ex- 
amples is slightly nonstandard, although easily shown equivalent to more standard definitions 
(for example, the unparameterised version of the definitions of [57] or [61]). It is essentially the 
same as the d efi ni t ion of polynomial learnability, but restricts access of the learning algorithm 
to positive examples only, and further requires that the concept it finds have perfect accuracy 
on the set of negative examples. 
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Definition 4.5.10 The representation class R = (R^T^X) is polynomially tamable from 
positive examples alone if there exists an algorithm A and polynomial p such that for all r eR, 
for every probability distribution D on elements ofc(r) (positive examples), and for all e,6 > 0, 
if A is given as input the parameters e and S and map access the oracle EXAMPLE{D,r), then 
in time A outputs some representation r 1 € R such that with probability at least 1-6, 

D(r - r') < e andr 1 - r = 0. 

Note that there are some representation classes (such as monomials) that are PAC-learnable 
from positive examples alone but are not polynomially learnable from positive examples alone. 
Also, note that Theorem 4.5.11 does not assert that representation classes that are PAC- 
learnable from positive examples alone are ss-learnable. 

Theorem 4.5.11 Let R = (R,r,c, X) be a representation class for the concept class C = 
(C, X). If Wis polynomially learnable from positive examples alone, then R is ss-learnable. 

Proof: By Theorem 4.5.8 (and the fact that learnability from positive examples alone 
trivially implies polynomial learnability) it suffices to show that if R is polynomially learnable 
from positive examples alone, then RR has a randomised polynomial-time hypothesis finder. 
We describe a randomised polynomial-time algorithm that, given as input a collection S C 
X x X of examples of some concept c(r) x c(r) e CC, will output the representation of a 
concept c(r') x c(r') € CC that is consistent with 5. 

Let S + consist of the positive examples of c x c in 5, and let m = |5 + |. Note that if 
(*,y) € S + then x € c and y 6 c (whereas if (x,y) is a negative example of c x c, we cannot 
deduce whether z £ c, or y £ c, or both). Form the set P = {* : 3y such that (x,y) € S + 
or (y, 2) € S + }> Let A be a learning algorithm for R that uses positive examples only. Now 
run A with accuracy parameter e < < and confidence parameter 6 < |. If a positive 
example is requested, randomly choose an element of P according to the (uniform) distribution 
D assigning each element of P prob*&2Uy ^yj. By the definition of polynomial learning from 
positive examples alone, A will find, with probability at least |, a representation r 1 6 R such 
that X?(c(r) - ctr')) < e < jjj and cfr*) - e(r) = 0. The first condition in fact asser.s that c(r') 
contains each element of P, otherwise the error according to 17 would be at least fa > a 
contradiction. The second condition asserts that c(r') contains no element of c(r). It follows 
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that c(r') x c(r') is consistent with 5. 



□ 



finally, note that by Lemma 4.5.7, Theorems 4.5.8 and 4.5.11 show ss-learnability in a 
stronger sense; the time needed to ss-learn t disjoint concepts is polynomial in t as well as J 
and }. 

4.0 Equivalence of Two Types of Learning 

An interesting aspect of the definition of ss-learnability is that it is not at all clear how an 
algorithm might test a candidate solution for correctness. In concept learning, it is possible 
to test the accuracy of the learned concept using examples of the unknown concept which are 
provided by the teacher. In ss-learnability, all that is available is a randomly generated pair, 
possibly totally unrelated to any examples seen before. 

from this perspective, a reasonable alternate definition of as-learning would only require 
that the algorithm find a set of formulas from the set F n that correctly predicts (within c) the 
labels from randomly generated LABELED-PAIRS, instead of requiring e- closeness to some 
unknown formulas. The alternate definition is given below; "sc" stands for "same concept". 

Definition 4.8.1 A family F = (F, T, c, 2) of Boolean formulas is sc-learnable if for each t e IN 
there exists an algorithm A and polynomial p such mat for all n, s > l } for every disjoint 
collection T s {/i,/a...,/ f } C F n with MAXSIZ»{^) < s, for amy probability distribution D 
on {0,1}" satisfying equation (4.1), and for all c,S > 0, if A is given as input the parameters 
€,6, and s and may access the oracle LABELED-PAIRS^,^/,, then in time p(n, s, }, }) A 
outputs a collection Q = {gug%... 1 g u } of (not necessarily disjoint) formulas in F n that with 
probability at least 1 - S has the following property: The probability mat a pair of examples 
drown from LABELED-PAIRS^ j, is incorrectly classified by Q as to whether or not they 
are from the same concept is at most e. (A pair is correctly classified by Q if either the pair 
is (*, y,same) and both z and y are m exactly one g € 0, or the pair is (x,y, different) and for 
tome g f g' € 0, g £ g*, z € g, y € g*, and neither z nor y ism any other element ofQ.) 

Note that in the above definition if a pair generated by LABELED-PAIRS contains some 
string x that is not contained in any formula of 0, then this is counted as an incorrect classifi- 
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cation. Similarly, if a pair contains a string in the intersection of two formulas of Q, then this 
is also an incorrect classification. 

Theorem 4.6.2 A family P = {F, I\ c, S) of Boolean formulas is sc4eamabk if and only if it 
is ss-learnablc. 

Prooft Suppose that F is ss-learnable. Then for any n, and any disjoint /i« /a ■••,/« € F n , 
we can obtain with probability at least 1-6, and in time polynomial in n, J, and J, a set of 
(not necessarily disjoint) formulas 0 = 9a snch that 0 is f -close to {/i, /a- ••»/«}• 
We show that using the learned formulas Q to predict sameness/differentness will satisfy the 
requirements of sc-learnability. 

Suppose that Q is § -close to /a • • ■ , ft}, that LABELED-PAIRS outputs {x,y, label), 
with label g {some, different}, and that a? € /■ and y 6 / V i where /»,/„ € /s • ••, /t}- 
Since 0 is f-dose to {/i,/ 2 . . .,/«} there is by definition a subset 0' C g and an injection 
{/ lf / 2 , ...,/«} such that, with probability at least 1 - §, a: is in exactly one formula 
0» € Qi g* is in Q', and x 6 (and thus I{g m ) = f m ). The analogous relationships are true 
for y. 

Case 1: f m - f y . With probability at least 1 - e, x is in exactly one concept g a 6 G, y is in 
exactly one concept g v € 0, g m end v are in and = /« = / y = X(0 V ), so the learned 
formulas (7 produce the correct response of "same concept". 

Case 2: /. # /„. With probability at least 1 - e, a$ is in exactly one concept g m 6 (7, y is in 
exactly one concept 6 0, ?• and ^ are in 0', and = /» # / y = so 0 produces 
the correct response of "different concepts". 

Thus with probability at least 1-6, Q = {ft, 0a-. .,««,} is f-dose to {/i,/s.-.,/t} and the 
probability of correct classification is at least 1 -e. Hence the fact that P is ss-learnable implies 
that P is sc-learnable. 

Now suppose that P is sc-learnable. If F is PAC-learnable as well, then we are done by using 
the sc-learned formulas as an oracle SC and applying Theorem 4.3.2. However, it is not clear 
whether P is sc-learnable implies P is PAC-learnable. (The obvious approach to showing this 
by letting t = 1 fails because there are then no negative examples, so nothing constrains the 
ss4earnmg algorithm from overgeneralixmg. If we let t = 2, with the second concept being the 
negative fTftmplf of some target concept to be PAC-learned, then the SB-learning algorithm is 
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only required to work provided that the negative examples can also be expressed as a formula 
in F.) We show that regardless of the PAC-leamability of F, if F is sc-learaable then F is 
ss-learnable. 

Let F be sc-learnable. Then for any n and any collection T = {/j, /i} of disjoint 
formulas in f n we can obtain in polynomial time, with probability at least 1 - 6, a collection 
0 — {Su9i"'i9»} of (not necessarily disjoint) formulas in F n such that the elements of Q 
correctly classify pairs from LABELED-PAIRS as to sameness/differentness with probability 
at least 1 - gj-Jy. We will show that Q is therefore e-dose to 7, and the theorem follows. 

Claim: For each ^-significant / 6 T there is a unique g 6 0 such that £(/ Hj) > i. 
Proof of Claims Assume there is no such jr. Then the probability of choosing two points 
from / not both in some particular g € Q (and thus obtaining an incorrect classification) is at 
least 

—(— — —\ — 

4r4t " 8*'" 32r* > 65t 2 ' 

a contradiction. Now assume that there is more than one such element of 0, say g and g'. 

Then the probability of choosing two points x,y such that x E f Hg and y € fr\g' (and thus 

obtaining an incorrect classification regardless of whether g and g' are disjoint) is at least (~) 3 

> 3gjp>, a contradiction, thus proving the claim □ 

To complete the proof of Theorem 4.6.2, we find a subset Q' C Q and a bijection J : 0' -» 
{/ € ^ : / is frsigni.iowit} (and hence an injection with range ?) witnessing that 0 is e-dose 
to T, For each ^-significant /, let the unique g such that D(f n g) > £ be an element of 0', 
with Jfo) = /. 

Now for each g € 0', D(g ® /(g)) < fc; for if not, then either D(c - 1(g)) > a, or 
i?(J"(^) - g) > jj. The cases are similar; we show that the first case cannot happen: If 

D{g-i{g)) = D{gnWj)>^ 

then since D{gM{g)) > & (by definition ofy), we have the probability that a misdassification 
occurs is at least (£) s > jjjj, a 
It follows that 

D(^2)«i)(U *©/(*))< 4 = f 
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Then, as in the proof of Lemma 4.2.7, 

E1-E2C [j{fi i fi is e/4t-insignificant}, (4.6) 

which is true if 

El n \J{fi : fi is €/4t-significant} C E2. (4.7) 

lb see that containment (4.7) holds, note that if x € fi and fi is ^-significant, then by 
the claim, there is a unique g, € Q' such that D{fi n g,) and thus I{g s ) = fi- If x 
is also in El = U?F, then x $ so x € 9, © J(#) Q E2, and containment (4.7) follows. 
Containment (4.6) implies that 

D{EI-E2)<1. 

By the definition of sc-learnability, 

D(EZ) = D{ U (*, n gj )) < < |, 

SO 

D(£l U £2 U £3) < i?(£2) + £(£1 - E2) + I?(£3) <|+J + | 3 =<- 

□ 

Corollary 4.0.3 Each of the families of Boolean formulas described in Theorems 4.2.1 and 
Corollaries 4.4.1, 4.4.2, and 4.4.3 is sc-learnable. Furthermore, any family satisfying the hy- 
pothesis of Theorem 4.3.2 is sc-learnable. 

We can also extend the definition of sc-learnability to representation classes over an unpa- 
rameterised domain. 

Definition 4.6.4 A representation class H. = (R, T,c,X)i» sc-learnable if for each t € IN there 
exists an algorithm A and polynomial p such that for every disjoint collection {n, r 2 . . . , r t } C R, 
for any probability distribution D over X such mot 

0(U*) = 1, 

iml 

and for all e,£ > 0, if A is given as input the parameters e and 6 and may access the oracle 
LABELED- PAIRS £>, n „^ lt then in time p{\ , }) A outputs a collection H = {h t , h i ...,h u } of 
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(not necessarily disjoint) representations in R that with probability at least l-S has the following 
proper ty : The probability that a pair of examples dram from LABELED- PAULS p )ri , is 
incorrectly classified by H as to whether or not they are from the same concept is at most c. 

A result analogous to Theorem 4.6.2 holds fox unparameterised representation classes. 

Theorem 4.8.5 A representation doss & (over an unparameterized domain) is sc-learnable if 
and only if it is ss-learnable. 

Proofi Similar to the proof of Theorem 4.6.2. □ 

Corollary 4.8.6 Any representation class satisfying the hypothesis of Theorems 4.5.4, 4.5.8, 
or 4.5.11 is sc-learnable. 
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5 PREDICTION USING WEAK AUTOMATA. 



Nearly all of the work done tons far in computational learning and prediction has allowed the 
learner the power of a (possibly time-bounded) luring machine. An interesting question is 
to what extent learning or prediction can be ac co m p li s hed by less powerful automata, i.e. by 
automata with less memory available than an infinite tape. In this chapter we consider this 
question in a different setting from the PAC model that was used in earlier chapters. 

The predictive power of luring machines has been studied in some detail, for example in 
the literature on NV-eztrapolation [6, 9, 19]. In this model, the predicting machine is shown an 
infinite sequence of strings; after each string, the machine outputs a guess as to whether the 
string is in the unknown target language. The goal is for the predicting machine to eventually 
make only correct guesses. The classes of languages that can be predicted under this model 
have been shown to be identical to the classes that can be inferred by a Popperian inductive 
inference machine [19]. Littlestone [54] considers bounds on the number of erroneous predictions 
for certain types of languages. In [38] the problem of predicting {0, 1} functions over a domain 
is considered when there is a probability distribution over the domain elements. 

Gold [31] introduces a model of prediction in which a "thinker" and an "environment" ex- 
change messages. The environment reads as input the thinker's previous response and generates 
new information, part of which is a reward/punishment signal, to be read by the thinker. The 
thinker then uses this information (as well as information received in earlier exchanges) to gen- 
erate its next response. The goal of the thinker is to generate outputs that, after a sufficiently 
large number of message exchanges, result in the ™»*tmmri possible rewards. Gold proves that 
there is a primitive recursive thinker that will eventually maximise its rewards for any finite 
state environment, but that no finite state thinker with this property exists. 

We investigate the predictive power of weaker varieties of automata, specifically determin- 
istic finite st£te automata (DFAs), 1-counter machines (lCMs), and deterministic pushdown 
automata (DPDAs). The model of prediction used here is essentially the same as in the model 
of NY-extrapolation, except for the type of automata doing the predicting. Alternatively, our 
model can be thought of as a restricted version of Gold's paradigm. Note that the study of 
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prediction hot differs from the problem of inferring an automaton from training examples, or 
that of predicting the outputs of an automaton (see, for example, [62, 64]). In those problems, 
the output may be, for example, a DFA, but the automaton doing the learning (or prediction) 
is a Taring machine. la the model defined here the preiiction is actually performeu by a DFA, 
DPDA, etc, rather than by a Turing 

In addition to being interesting in its own right, the study of the predictive ability of weak 
automata may shed some light on the predictive power of arbitrary programs. In particular, 
strong negative results for limited types of machines may suggest techniques by which to prove 
analogous results for more powerful machines. Interesting problems related to this model of 
prediction include determining which classes of languages are predictable and finding upper 
bounds on the size of the classes that can be predicted by automata of a certain size (or, 
equivalently, finding low bounds on the sise of the smallest predicting automaton for classes 
of a particular size). We consider both of these questions. 

5.1 A Model for Prediction by Finite State Automata 

Recall that if a is a string of characters, then \ct\ is the length of a; i.e., the number of characters 
in o. Let A be an alphabet. Recall that A* is the set of all finite strings of characters in A of 
length at least zero. Let A + represent the set of all such strings of length at least one. We use 
€ to denote the empty string. 

The model for prediction by DFAs is as follows. Let Af be a Moore machine [40] and L be 
some language over 2* (Le., a subset of £*) for some finite alphabet 2. Initially Af is given as 
input a finite string <T\ € 2*. Af makes a guess as to whether or not o\ is in L; this guess is 
the output associated with the state that Af is in after having read C\. If the output is "+", 
then AT has guessed that o^ € X. If the output is then Af has guessed that o*i is not in L. 
Af is then given as input either the character u r" or "to", depending on whether its guess was 
right or wrong, respectively. This process is then repeated for the strings <7 2 ,<r 3 ,.. „ If after 
some point all of ATs guesses are correct, then Af U correct on (£,<r), where <r = <ri,crj, o^,.. - 
If Af is correct on {£,?) for every a and for every L in some class of languages (7, then Af is a 
predicting DFA for C. We make this more 

Let Af = (<?, 2 U {r, u?}, A, £, A, ft) be a Moore machine with finite input alphabet Su{r,w} 
(where r and w are special symbols not in 2) and output alphabet A = {+, -}. Q is the (finite) 
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state set and qi is the start state. The function A : Q -> A describes the state outputs and 
6 : Q x (2 U {r, to}) -» Q is the transition function. As is done in [40], we extend the definition 
of the transition function £ to handle input strings, rather than just single input characters, as 
follows. Define i : Q x (S U {r,w})* ->Qby 

1. Ear each q € Q, £($,«) = 9* and 

2. lor each ? 6 <?,*€(£ U {r,*})*, and o 6 E U {r, w}, ${q,xa) = *(£($,*), a). 

Thus #(g,sc) is the state that M is in after starting in state q and reading the input string x. 
for convenience we use 6 is. place of $ in what follows; the two functions agree on all arguments 
for which both are defined, so there will be no ambiguity. Let C be a collection (or class) of 
languages over the alphabet S. 

Definition 5.1.1 Let <r = <rj, <r 2 , <rs, . . . be an infinite sequence, where each <*{ is a finite siring 
in £*. Let C be a class of languages over E an I let L be a language in C. Define the presentation 
of a with respect to L and M (denoted by PRESm(L, a)) to be the string a\ 61^02^3*3 • • •» vhere 
each character hi is defined as follows. Let pi € Q be the state that M is in after reading the 
input string c\b\C2h .. ><r% (i.e. Pi ~ &{q\jO\b\ffifa. ..*)). If &i € L and X{jh) = *+", or if 
c*i & L and A(p<) = thenk= "r". If a € L and Xfa) =. u - n , or if* $ L andXfa) = 
"+", then hi = "w". If there exists some io sue* that, for all i > io, 6. = u r", then M is 
correct on (X,«r). If M is correct on (L,a) for every a, then M is correct for L. If hi = u r" 
for alii, then M is exactly correct on (£,?). If M is exactly correct on (i,<r) for every a, 
then M is exactly correct for L. The definitions of exact correctness apply whether <r is finite 
or infinite. 

Thus M is correct for X if, when presented any infinite sequence of finite strings over E, 
there is some point past which all of its guesses (as to whether a string is in L) are correct. M 
is exactly correct far L if all of its guesses are correct. 

Definition 5.1.2 If, for all L € C, M is correct for L, then M is a predicting DFA for C. If 
there is a predicting DFA for C, then C is DFA-predictable. 

The size of a predicting DFA M (denoted by \M\) is the number of states in M, We also 
use this terminology and notation when M is a DFA (i.e. not a Moore machine). 
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Predicting membership in an unknown language according to the above model appears 
to be a more suitable task for DFAs than is concept learning. The problem of Wwifag is 
closely related to that of finding hypotheses consistent with a finite set of examples [11, 13, 70]. 
Because of the limited memory available to a DFA, it wnld require a very large automaton 
tc remember all examples seen — or, alternatively, to seep track of aU hypotheses consistent 
with the examples already seen — and thus to be able to find consistent hypotheses. In the 
prediction model, the automaton w permitted to misclassify some examples; it doesn't need to 
remember each example, but instead can wait until it's "ready" for a particular example before 
correctly classifying it. 

Note that the predicting DFA is not guaranteed to see a complete presentation of all strings 
in the domain. But since the automaton is not required to output a hypothesis describing the 
target language, but instead must only be correct on the strings it sees (past some point), this 
does not present any difficulties. 

5.2 A General Upper Bound 

In this section we demonstrate an upper bound on the size of any class that is predictable by 
a DFA of size n. 

Theorem 5.2.1 Let C be a DFA-predictable class of languages. The smallest predicting DFA 
for C has size at least 2\C\ - 2. Thus any class with a predicting DFA of size n contains at 
most f + l languages. 

Let Af = (<?,SU {r,w}, A,«,A,oi) be a predicting DFA for C, with Q = {ft,.. .,$„}. For 
any state ft g Q, let Af* = ($,2 u {r,u>}, A,tf,A,ft) (the Moore machine M with start state 
ft). 

Before proving Theorem 5.2.1, we first main the following definitions and prove several 
lemmas. 

Definition 5.2.2 let L be any language m C and let q% t g Q. If there exists some sequence 
of strings a such that there is a finite prefix *ib l e i b%...o>kbk of PRBSm(L,*) with 
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then qn L is reachable modulo prediction. If<th L ** reachable modulo prediction and the automa- 
ton Af*** u exactly correct for L, men qn L it a home state for L. 

Thus a state qh L that is reachable modulo prediction is a home state for L if the predicting 
DFA Af, when started in state ®i t , is exactly correct for L. for each L 6 C, let HS^ be the set 
of home states for L. Lemma 5.2.3 states that each language in C has at least one home state. 

Lemma 5.2.3 For each LeC, SSj, # 0. 

Proofi We assume otherwise and prove that a contradiction results. Suppose that for some 
£, BSl — 0. Thus for each q € Q, either 9 is not reachable modulo prediction or Af 9 is not 
exactly correct for L (or both). Clearly there is at least one state that is reachable modulo 
prediction. For each such state $, Af is not exactly correct for L. Thus there exists a finite 
sequence of strings whose presentation causes Af* to make an incorrect guess; that is, there 
must exist a positive integer and a sequence of strings <r* = <r£,<r£,...,<7^.,... such that 
6* 4 =«V» in the prefix o\b\o\b\. ..o 4 ^ of the presentation PRESjr«(£,0*) (with the oj,'s 
denned analogously to the b k 's in Definition 5.1.1). The sequence of strings <r n "* to * < " denned 
by the following procedure forces Af to make an infinite number of incorrect guesses. Recall 
that 91 is the start state of Af . 

1. Set < r mi ' tak * t = a 1 . Note that Af makes an incorrect guess on the string o\. 

2. Let qj be the state that Af is in after having been given as input PBESm^,*"** 8 *") 
(for p*"***"*" as denned so far). Since qj is reachable modulo prediction, by assumption 
M 9 * is not exactly correct for L. 

3. Append the sequence a> to the end of a m ** taJr **. Af makes an inco.-rect guess on the string 
<rj i . Return to Step 2. 

Since Af makes an infinite number of incorrect guesses in the sequence Af is not 

correct on (£,<r m **** M ). Thus Af is not correct for X, so Af is not a predicting DFA for C. 
Since this contradicts the definition of Af, HSt must be nonempty for each I 6 C, proving the 
lemma. □ 
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Lemma 5.2.4 A state q € Q is a home state for at most one language in C. 



Proof; Let q € <?, and let L\ and L% be (distinct) languages in C. Since Li and Li are 
distinct, there exists a string a € L\ © £3. M* cannot be exactly correct on both {£1,2) and 
{£3,*). Thus Af* is not exactly correct for both L\ and £2, so q is not a home state for both 
L\ and L\. □ 

Lemma 5.2.5 states that if Af is in a home state for L after haying read as input a prefix of 
PRESjf (X, <t) ending with an u r n or "to", then from that point on M will always be in a home 
state for L after each prefix of PBESjf (L,<r) ending with an "r" or u w n has been read. 

Lemma 5.2.5 Let 0 = c\ t <r^ . . . be any infinite sequence of finite strings. Consider the string 
PRESm(L,v), as defined in Definition 5.1.1. If, for some k, 

6{qii<ribiO'2bi...cri,bk) 6 ESl, 

then for all j > k, 

0ibiOzb% . . .Ojbj) € HSl. 
Proof) Suppose that k < j, q^) and q^ are states such that 

HqiiOihvzh ' ' -Okbh) = Q(h) 

and 

&{qii O'ibiotbs . . .Ojbj) m 

and is not a home state for L. Then, by the definition of home states,„there exists a sequence 
of stxings o-W such that AT*W is not exactly correct on (X,<r^). Thus M*f*> is not exactly 
correct on (X,ffl*+ 1 «*Wfl), where o-t** 1 ^ = 0-4+1,0^2,...,^. This is because Af^*), after 
reading as input tr(* +1 ^, is in state ty), and AT*') is not exactly correct on (L,oW). Hence 
M*(*) is not exactly correct for X, and thus is not a home state for L. The result follows 
by c o n t ra pos i tion . Q 

for each L £ C define 

PHS£ = %,r) € HS L and \(q) = • + "} 
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and 

PHS£ m {q 6 Q : %,r) 6 ES L and X(q) ««-»}. 
(The notation is intended to suggest "pre-bome states".) 

Lemma 5.2.8 Let L be any language m C not equal to 0 or £*. Then both PHSfc and PHS£ 
are nonempty. Furthermore, if$€C then PBSf is nonempty, and if 2* € C, then PHS£. is 
nonempty. 

Proofi Let x and y be finite strings over £, with as € L and y & L. By Lemma 5.2.3, ESl 
is nonempty, so we can let qss be a home state for L. Let qpss he the state such that 

Hqbs,*) = Qpss- 

Since girs is a home state for L and since * € X, by Lemma 5.2.5 and the definition of home 
states S(qpHSyr) 6 HS^ and *{qpns) = w + n . Thus qpss € PHS J. Since such a state qpas 
must exist, PHS J ^ 0. By an analogous argument, using the string y £ L, PHS£ # 0. 

Similar arguments can be used to show that PHSg and PHS£. (provided that the languages 
0 and £*, respectively, are in the class C) are nonempty. □ 

Proof of Theorem 5.2.1: Note that for any (not necessarily distinct) languages L\ and 
Li in <7, PHS^ n PHS^ = 0, since the states in the two sets have different images under 
the function A. Furthermore, if L\ # L% then PHS £ n PHS J, - 0, due to the fact that the 
transition function S on input u r n maps the states in the two sets into states in HS^ and HSl,, 
respectively. By Lemma 5.2.4, HS^ and HSl, are disjoint, so PHS^ n PHS^ =0. By a 
similar argument, PHS^ n PHS^ = 0 for any distinct L\ and L t in C. 

Thus there are at least \C\ - 2 languages L in C such that L has associated with it two 
nonempty sets of states PHS J and PHSJ, and all such sets are pairwise disjoint. At most two 
languages in C have associated with them a single nonempty set of states (PHS^ or PHSj^,) 
disjoint from the other sets and from each other. Consequently the number of distinct states 
in Q is at least 2{\C\ - 2) + 2 = 2\C\ - 2, so the sise of Af is at least 2|C( - 2. This concludes 
the proof of Theorem 5.2.1. □ 
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For any n > 0, define C n to be the class of languages (over the binary alphabet) of satisfying 
assignments of monomials over n variables, augmented by the two languages 0 and {0, l} n . It 
can be shown that the bound of Theorem 5.2.1 is tight for C m provided that only n-bit input 
strings are permitted. 

5.3 Languages Predictable by DFAs 

In this section we characterize the classes of languages that can be predicted by DFAs. 

Theorem 5.3.1 The DFA-pvediciable dosses of languages are exactly the finite classes of reg- 
ular languages. 

Proof: We first prove that any finite class of regular languages is DFA-predictable. 

Lemma 5.3.2 Let C = {£1,2)3, . . . , L e } be a finite class of regular languages over 2, and let 
Mi, Ma,..., M t be finite state machines that accept the languages Ii,X 2 ,. ..,L e , respectively. 
Then there exists a predicting DFA M for the class C such that 

|M| = |Mi| + |Af 2 | + ... + |M c |. 

Note that we assume that for each Li there is a DFA M< that accepts all strings in Li and 
rejects all strings in 2* - Li. A similar result can be shown if we allow each language Li to be 
denned over its own alphabet E<, and assume that the DFA Afc accepts all strings in Li and 
rejects all strings in 2? - I*. In this case it is easily seen that there exists a DFA M[ of size 
jJkf<j + 1 that accepts £< and rejects 2* - X< (where £ = \J[ ml S<). Thus a bound on the size of 
Af of \M[\ + \M' 2 \ + ... + \M' e \ = jAfij + |Af 3 | + ...+ jM c | + c is easily obtained. Ibr clarity of 
presentation we prove the result as stated in the lemma. 

Proof: The lemma is proved by constructing a predicting DFA Af for C, using the accepting 
DFAs. M first simulates Mi and makes all of its guesses based on whether the input strings 
are in L%. If M makes an incorrect guess, it then starts «*™" u «f"g Af 3 , and makes its guesses 
based on the language L%. This continues until Id finds the right language, after which point 
all of its guesses will be correct. 

For each t such that 1 < t < c, let Afc = (Qi,E,Su4Lrt,Fi), where the set of states is 
Qi - {9i«vj»****9[j4|} I* 1 * *** of accepting states is Fj. 
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A(?) = 



We define the predicting DFA M = (<?, £u{r,u»}, A, S* A, «.*,»). As for all predicting DFAs, 

« 

r, w g £ and A = {+, -}. Let Q = <?i and = TJw Action A is defined as 

follows. Let q be any state in Q, and suppose that q € Q*> Then 

+ ifogJ 5 ;- 

- XqtFi. 

All of the transitions in . . .,£„ are also in S. In addition, 6 contains the following transi- 
tions, for each i = l,2,...,c and each 3 = X, 2, — , !<?•!» ** contains the transition denned by 
6{q), r) - g*^. for each * = l,2,...,c- 1 and each 3 = 1, 2,...,|<?<|, 6 contains the transi- 
tion denned by 6{q), w) - Finally, far each j = 1,2,..., \Q e \, * contains the transitions 
6(q* t w) = qttarf (Actually, it is irrelevant how this last set of transitions is defined.) 

Clearly M has rise JAfiJ + |Af 3 | + ...+ ]Af e |. Let Li be any language in C. We prove that 
M is correct for D<; it then follows immediately that M is a predicting DFA for C. Let a - 
<ri, <r 2 , . . . be any infinite sequence of finite strings over E. Consider the string PRES m (L^ cr) = 

C\b\tT^b^ mm — 

J 

Claim 5.3.3 The number of to's tn the sequence bi, 63, . . . is less than i. 

Proof of Claim: Suppose that 6^ is the (t - l) 8t occurrence ofw'mb\,b 7i — We show 
that no more 10's will appear. Note that, by the definition of the transition function 6 , all states 
entered by M between the s th and (s + l) st appearances of w in PRES # (£,-, tr ) are states from 
<?«+i, and thus that immediately after reading the occurrence of w M is in state g*£l r Thus 
after reading cr^^jfrj ...<r ffl tu, M is in 9*^. Suppose that the string <r m+1 is in Z{. Then, 
since gj^^ is the start state of Afj, which accepts L<, and since all of the transitions in #j are in 
M(«LrSf*»+i)€/i. Thus by the definition of A, A(^(«S Mv «- m+1 ))s u + ". Since <r m+1 € I<, 
the value of ft, 1+ i will be V. Similarly, suppose that <r m+1 g £,-. Then %* tert ,<r m+1 ) 0 F<. 
Thus by the definition of A, \{6 {qUart* *m+i)) = *-*. Since <r m+1 £ Z<, the value of b m +i will 
again be V. Note that in either case, by the definition of 6, M will be back in state q^tart after 
having read Om+j. An easy induction shows that each of &w+it6m+3t* • > is an V, proving the 
claim, □ 

Thus only a finite number of the 6*'* are tp*s, so there exists some 0 such that for all Jb > Icq, 
bh - V\ Thus M is correct on {Luff). Since <r was chosen arbitrarily, M is correct for 
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Note that Lemma 5.3.2 also yields an tipper bound on the size of a predicting DFA. In addition, 
the proof gives a technique for constructing a predicting DFA for any DFA-predictable class C 
that makes at most \C\ - 1 incorrect predictions. 

It remains to be shown that any DFA-predictable class is a finite class of regular languages. 
Lemma 5.3.4 states that all languages in a DFA-predictable class must be regular. 

Lemma 5.3.4 Let C be a DFA-predictable class, and let L be a language in C. Then L is 
regular. 

Prooft Let M = (<?, S U {r, u»}, A,6,A,ft) be a predicting DFA for C. We define a DFA M ' 
that accepts L, Let q* be some home state for X; at least one such state exists by Lemma 5.2.3. 
Then define M' = (<?, E,£',g / ,.F) f where the set of accepting states F is denned by 

F = {o€Q:A(g) = « + »} 

and 6' contains all transitions in S that don't involve the symbols r or to. To see that M' ac- 
cepts L, let x be any string in £*. If x 6 I, then since ^ is a home state for I, the automaton 
Jtf*' is exactly correct for I, and thus exactly correct on (L,x). Thus A{£(g', z)) = M + so 
W» *) £ F. Similarly, if x $ L then A(W, *)) = "- so ^(^, a) £ JF\ Thus M' accepts L, 
so £ is regular. 0 

Aiiy predicting DFA must have a finite number of states; thus by the lower bound of The- 
orem 5.2.1, the number of languages in a DFA-predictable class must be finite. Hence the only 
DFA-predictable classes are the finite classes of regular languages. □ 

5.4 A Model for Prediction by Deterministic Pushdown Automata 

We now define a model of prediction by deterministic pushdown automata (DPDAs). The 
model is the same as hi the case of DFAs except for the type of machine doing the predicting. 
Prediction is now performed by an automaton that is a variant of the standard DPDA in which 
there is an output associated with the states of the machine. 
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Let Af = (Q, E U {r, w}, I\ A,6, A,9i,Zi) be an automaton with input alphabet E u {r, to} 
(where r,tcg S) and stack alphabet I*. Q is the set of states, ft is the start state, and Z\ 6 T 
is the start symbol (Le., the symbol that is initially on the stack). E, I\ and Q are all finite. 
The transition fraction 6 maps elements of Q x (E U {r, w] u {«}) x T to finite subsets of Q x I". 
Following [40], in order to ensure that Af is deterministic, we place the following two constraints 
on 6. 

1. Let q e Q, a 6 E U {r,to}, and Z € T. If %,o, Z) ? 0, then %,«, Z) = 0. 

2. for each q 6 Q, a 6 E U {r, »} U {«}, and Z 6 I\ o"(fl» o, Z) contains at most one element. 

Thus at any stage in a computation by M then is at most one transition that can be 
applied. 

Define O C Q xT to be the set of pairs (q,Z) such that 6(q,e,Z) = 0. Thus if M is in state q 
with the symbol Z on top of its stack and (9, Z) 6 0, then M cannot make a transition without 
reading a new input character. If (9, Z) $ O then Af cannot read another input character until 
it has made one or more e- moves. 

Af operates as a deterministic pushdown automaton (as denned in [40]) with the following 
exceptions. We are interested in the outputs produced by a DFDA, rather than the language 
that it accepts. Thus we dispense with the set of accepting states that is included in the 
definition in [40], and instead add an output alphabet A = {+, -} and a function A : Q -» A. 
The function A associates an output with each state in Q. We are interested in the outputs of 
the function A when M has read an input string and has exhausted all possible e-moves. 

An mstantaneotu description (ID)oSM is a triple (9,2,7)) where q 6 <?, z € (E u {r, tc})% 
and 7 € T*. The £D records the state, input remaining to be read, and stack contents of 
M at some point in its computation. The binary relation >-*m is defined such that, if ID\ 
and ID* are instantaneous descriptions and IDj describes Af's computation at a point one 
step later than ID\, then ID% y-*u ID* More formally, let $,Oj 6 Q % a € E U {r,to} u {e}, 
x € (E U {r,w})*, Zi 6 T, and 7^ € P. If (ft,Z;) € O, then ( ft ,aas,7»Z<) («i,*,7»7j) 
if and only if tf(ft,o,Z<) = {(fc,7i)}- If (ft, Z g ) ff O, then (*,o«,7*Z<) («,•, a*, 7,-7;) 
if and only if «"($,«, Z<) = {(?,•, 7,)}. Note that in this notation the symbol on the top of 
the stack is the rightmost symbol in the string of stack symbols. This is the opposite of the 
standard convention; in the opinion of the author, however, stack operations are represented 
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more naturally under this system. Let represent the reflexive and transitive closure of 
. Thus TDi *-*lg IDj if JP 3 describes M's computation at some point zero or more steps 

later than ID\. We omit the subscript M when it is clear from context. 

The following definitions are exactly analogous to those in Definition 5.1.1. 

Definition 5.4.1 Let M be as defined above. Let a, C, and L be as m Definition 5.1.1. 
Define a presentation of <r with respect to L and M (denoted by PRESm(L, a-)) to be the string 
0*161*2 620-363 . . ., where each character bt is defined as follows. Let jh e Q and Zi eT be the 
current state and the symbol on top of the stack of M, respectively, after M has read the input 
string o\ 610262* •• 0», exhausted all possible e-moves, and is prepared to read the next input 
character (so € 0). If <Ti € L and A(nj) =«+» orif<n$L and \(pi) = then 

bi- If<Ti 6 L and\(pi) = «-», or if 0* g L and A(»i) m a + n , then 6< = «u»» If there 
exists some i© such that, for all i > io, 6g = "r* then Af is correct on (1,0). i/Af is correct 
on {£, tr) /or every 0, tfcen Af is correct for L. Ifbi = (t r" for all i, then M is exactly correct 
on (X,0>. If M is exactly correct on {L,cr} for every 0, then M is exactly correct for L. The 
definitions of exact correctness apply whether 0 is finite or infinite. 

As was the case for DFAs, M is correct for L if eventually it makes only correct guesses as 
to whether strings are in X, and is exactly correct for X if all of its guesses are correct. 

Definition 5.4.2 If, for all L & C, M is correct for L then M is a predicting DPDA for C. 
If there is a predicting DPDA for C, then C is DPDA-predictable. 

Note that if M is a predicting DPDA then its stack is never empty in any ID that appears 
during a. computation on input PR£S*/(X,0) for X € C and 0 as described above. Since no 
transition* are denned when the stack is empty, M would halt if its stack were empty and 
be unable to read any more input. This violates the definition of a predicting DPDA, which 
requires the automaton to function on arbitrary presentations. 

5.5 A General Upper Bound for DPDAs 

In this section we prove an upper bound on the size of any DPDA-predictable class relative to 
the size of the predicting automaton. 
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Theorem 5.*. 1 Let M m (<?, S U {r, w), r, A, 6, A, ft , Zx ) be a predicting DPDA for some class 
C. Then \C\ < \Q\\T\. 

Proof) We first define home configurations, which axe analogous to home states for predicting 
DFAs. Let Q = {fc,«a,...,?»} and let 7 be a string in T\ Define TOPfr) and BOTTOM (7) 
to be the rightmost and leftmost characters, respectively, of 7. For any ft G Q, we define the 
predicting DPDA = (Q U {ft}, E U {r, w}, T, A, A, go, Zi), where * 7 contains ail of the 

transitions in as well as the transition 

Thus Af^' 7 ^ is the predicting DPDA that sets the stack contents equal to 7, moves to state ft, 
and then simulates M. 

Definition 5.5.2 Lett be any language in C , ft be a state in Q, and 7 fee a string in T + . The 
pair [ft, 7] is a configuration o/M. fiVofe too* o configuration is stmifar to an tnstantoneous 
description, but without the input string information. Thus a configuration is independent of 
the input to M.) J/(ft,TOP(7)) € O, then [ft, 7] is an I/O configuration, //[ft, 7] ** <>» //O 
configuration and there exists some sequence of strings or with a finite prefix vibiPih . ..aubh 
of PRESm{L,<t) such that 

(ftt 'lfrieafe . . . <r fc 6jb, Zi) •->•= (ft,€,7)» 

then the configuration [ft, 7) is reachable modulo prediction. IfM^'^ is exactly correct for L, 
then the configuration [ft, 7] ** also said to be exactly correct for L. //[ft, 7] w both reachable 
modulo prediction and exactly correct for L, then [ft, 7] is a home configuration for L. 

Thus an I/O configuration [ft, 7] that is reachable modulo prediction is a home configuration 
for L if the predicting DPDA AT, when started in state ft with stack contents 7, is exactly correct 
for X. For each L 6 C, let HC& be the set of home configurations for I. Note that for any 
ft € Q y x 6 S U {r, to}, and 7i € It, there is ai most one I/O configuration fa;, 7;] such that 
(ft,*,7») 7i), for some 4 6 Q and 7; 6 I*. 

The following three lemmas are analogous to Lemmas 5.2.3, 5.2.4, and 5.2.5 in the proof 
for DFAs. 
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Lemma 5.5.3 For each LeC, ECl # 0. 



Proofi We assume otherwise and prove that a contradiction results. Suppose that for some 
Z, HCj = 9. Thus for each configuration [9,7], either [9,7] is not reachable modulo predic- 
tion or [g, 7] is not exactly correct for L (or both). Clearly there is at least one configuration 
that is reachable modulo prediction. For each such configuration 7^], [$, 7,] is not exactly 
correct for L. Thus there exists a finite sequence of strings whose presentation causes Affc") 1 
to make an incorrect guess; that is, there must exist a positive integer Sij and a sequence of 
strings 0* m of*, . . . , 0% , . . . such that o*£ in the prefix *i J b?<T?btf . . . 6j£ 
of the presentation PRES^.^Z, o+*) (with the oJ"s defined analogously to the o*'s in Def- 
inition 5.4.1). The sequence of strings <r m " ta **' defined by the following procedure forces M 
to make an infinite number of incorrect guesses. Recall that q\ and Z\ are the start state and 
initial stack contents, respectively, of M. Let 71 denote the string consisting of only the start 
symbol Z\. 

1. Set a miHakm9 = tr 1 * 1 (i.e. a sequence that forces an incorrect guess by the automaton 
jijf — m). Note that M makes an incorrect guess on the string a]* 1 . 

2. Let q u be the state and j v be the stack contents of M after M has been given as input 
?"RES M (L,v mi * akm ') (for a******** as defined so far), has exhausted all possible amoves, 
and is ready to read the next input character. Thus [<ju>7c] is an I/O configuration, and 
thus reachable modulo prediction. 

3. Append the sequence o^* to the end of a mi * tak *'. M makes an incorrect guess on the 
string cr£* . Return to Step 2. 

Since M makes an infinite number of incorrect guesses in the sequence c mittak *', M is not 
correct on (L,a miMak "). Thus M is not correct for X, so ii is not a predicting DPDA for C. 
This contradicts the definition of Af, so HCj, must be nonempty for each X 6 C. □ 

Lemma 5.5.4 No configuration can be exactly correct for more than one language in C. Thus 
a configuration [9,7] is a home configuration for at most one Uxxttsuge in C. 
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Proofi Let q 6 Q, 7 6 r+, and let X a and Xa be (distinct) languages in C. Since Xi and 
L% are distinct, there exists a string » 6 Xi © Xj. Affe ,7 J cannot be exactly correct on both 
{Li $ z) and (X 3 ,x). Thus M^ 9 ^ is not exactly correct for both L\ and £3, so [4,7] is not a 
home configuration for both L\ and £?. □ 

Lemma 5*5.5 Let 0- = 0-1,0-2, .,0-t be any finite sequence of finite strings, and let [ft, 7*] 
an I/O configuration that is exactly co rr ec t for L. Consider the string PRESm{L, &)> If 

(ft 1^1*1^2*2 • • .<rt*tt7*) («t«» 
such that [qjrfj] is an I/O configuration, then [<fc,7jj is exactly correct for L. 

Proofi Suppose that [qj, jj] is not exactly correct for X. 'Then, by the definition of exact 
correctness, there exists a sequence of strings a such that M fe^A is not exactly correct on 
{L,&). Thus Afk*' 7 *} i s no t exactly correct on (X, <rd). This is because M^* ,7A 5, after reading 
as input the presentation of <r and exhausting all e-moves, is in the configuration [<y,7j]> ^d 
Af^» 7 ^ is not exactly correct on (X,0-). Hence Afk* ,7 *J is not exactly correct for X, so [ft, 7*] 
is not exactly correct for X. This contradicts our hypothesis, proving the result. □ 

Let [9,7] be a home configuration for X. We define MlNSTK([g,7]) to be the longest prefix 
7' c~*7 such that, for all sequences <r as 0-1,0-?,, ..,<7t (where each 0*, € £*), each configuration 
of M that is reached in the computation 

(s,PRESm(£,*),7)~* (?,c,7), 

where (9,1,7) is an I/O configuration, is of the form (^,^,7 ; a), for some g 1 € <?, some suffix 
0 of PILES Af(Z,0-), and some a e T*. Thus MiNSTK([g,7]) is the maximal bottom portion 
of the stack that remains unci- inged throughout any computation of M that begins in the 
configuration [9,7] and reads as input a presentation of any sequence <r with respect to X and 
M. 

We partition the class C into two subclasses C% and C% % as follows. Define C% to be the set 
of languages X 6 C such that, for every home configuration [9,7] of X, the string minstk([^,7]) 
has length at least one. Define C* to be the set of languages X 6 C such that, for some 
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€ HC&, mjnstk([$,7]) is the empty string. Thus C% contains all languages in X 6 C such 
that, once M is in a home configuration for X, there is some (nonempty) string of characters 
on the bottom of the stack that will remain unchanged throughout any possible computation 
of M on input any presentation of strings with respect to X and M. The subclass Ct contains 
those languages L with at least one home configuration with the property that, if M is started 
in that configuration, then there is some presentation that will cause the bottom character on 
the stack to be changed. Clearly C\ and Ca are disjoint and C = C\ U Cj. We first prove two 
claims about C\ and Cj. 

Claim 5.5.6 For each L € C u there exists a state q, a character Z 6 T, and a string 0 € T* 
such that 

(g,r,Z)~-(p,e,Z0), (5.1) 
where [p, ZjS) is an I/O configuration that is exactly correct for X. 

Proof of Claim: Suppose that X 6 C\. Let [q,y] 6 HCj. By Lemma 5.5.3, some such 
home configuration exists for X, and by Lemma 5.5.4 it is not a homw configuration for any 
other language in C. By the definition of Ct, there must exist some sequence o = 0*1, <rj, . . ., o t 
such that for some state q, 

(g, oihoibt ...o u 7) t->* (g, e, MINSTK([g, 7])), 

where 0161*303 ...0% is a prefix of PBESjf (X, o). (In fact, oibiOih . . .o t is all except the 
last character of PBES* (X,<r).) That is, there is some sequence o such that after reading 
ffi&i*?*} • the stack contents of M is exactly MlNSTK([g,7]). If this were not the case then 
MlNSTK(fa,7]) would not be of minimum length, as is required by its definition. We can assert 
that a stack equal to MiNSTK([g, 7]) is achieved after reading the last character of o u rather 
than after some &j, since if the latter were the case we could define to be the empty string. 
Note also that since [9,7] € HC&, b t - u r n . 
Letp€Qand£€r*be such that 

(4f,a,,MINSTK([ ? ,7])) ~* (p,e,MINSTK(fo,7])0), (5.2) 

where [p, MlNSTK([g,7])0] is an I/O configuration, (Such p and $ must exist, by the definition 
of a predicting DPDA.) Thus 

(fl, ***1*2*3 • • • 7) (P» «, MINSTK([g, 
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Since [4,7] is a home configuration for L, it is an I/O configuration that is exactly correct 
for L. Thus, by Lemma 5.5.5, [p,MDfSTK([g;,7])0] » also exactly correct for L. By the def- 
inition of minstk([c,7]), if M is started in configuration [p,MiNSTK(fo,7])£] and given in- 
put PRESjf (•&>*)> for any sequence &, at most only the top element of minstk([o,7]) (i.e., 
TOP{MiNSTK([g,7]))) will ever be scanned by M, and it will never be removed from the stack. 
The stack elements beneath top(minstk([c;, 7])) will never be scanned nor removed from the 
stack. Thus if M is started in the configuration [p,TOP(mNSTK([tf,7]))/3] and shown any pre- 
sentation PBES A f(X,a), it will always produce the same outputs as if it had been started 
in configuration [p,MiNSTK([g,7])0]. Hence the configuration [p, top(minstk([^, 7]))$ is also 
exactly correct for L. Furthermore, by (5.2), 

J, 6 t ,TOP(MlNSTK([g,7]))) (p,e,TOP(MlNSTK([o,7]))y9). 

Since [p,MINSTK([g,7])$ '*» I/O configuration, the pair (o,TOP(MlNSTK([g,7])0)) is in 
O, and thus [p, TOP(TOP(MlNSTK([g, 7]))^)] is also in O, so [p,TOP(MlNSTK([g,7]))# is also an 
I/O configuration. The claim is then proved by setting Z - TOP(MiNSTK([g,7])) and noting 
that b t = V. □ 

A similar result can be shown for the subclass Cj. 

Claim 5.5.7 For each L € C 2t there exists a state q, a character Z € T, and a string 0 6 T + 
such that 

{4,r,Z)~*{p,€ 9 0) $ (5.3) 
where \p,0] is an I/O configuration that is exactly corr e c t for L. 

Proof of Claims Suppose that L € C3. Let [4,7] € HCl- By Lemma 5.5.3, some such 
home configuration exists for L, and by Lemma 5.5.4 it is not * home configuration for any 
other language in C, By the definition of C?, there must exist some sequence <r = <7i,<7 2 ,.. .,<Tt 
such that for some state g, 

(g, axbiOibi ,.,* u l) (ft «i BOTTO!l(7))i 

where cr^a^h . . .0^ is a prefix of PILES *#(-£» <?). That is, there is some sequence a such that 
after reading 01010*63 . . .0* the stack contents of M is exactly bottom(7). (Such a must exist 
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since it must be possible to empty ATs stack with legal input.) We can assert that this situation 
is reached after reading the last character of a tt rather than after some &*, since if the latter were 
the case we could define <r» + i to be the empty string. Note that since [9,7] 6 HCj, b t = M r". 
Let p 6 Q and 0 6 T + be such that 

($,&*, BOTTOM^)) (p,€,0), 

where [p, 0] is- an I/O configuration. (Such p and £ must exist, by the definition of a predicting 
DPDA.) Thus 

(fft*i*i<74&a...<7**t»7) (PiC.^). 

Since [4,7] is a home configuration for L, it is an I/O configuration that is exactly correct for 
L. Thus, by Lemma 5.5.5, [p,/3] is also exactly correct for X. If we set Z = BOTTOM (7) and 
observe that 6* ="r", the claim is proved. □ 

By Claims 5.5.6 and 5.5.7, for each L g C there exists a state q, a character Z G r, and a 
string A 6 T + such that 

(«,r,Z) »-** (p,e,A), 

where [p, A] is an I/O configuration that is exactly correct for X. For each such ID {q, r, Z) there 
is at most one configuration [p, A] for which this is true, since M is deterministic. The number 
of possible IDs of the form ($,r, Z) is at most }<?||rj. Since, by Lemma 5.5.4, no configuration 
is exactly correct for more than one language in C, the number of languages in C can be no 
more than |<?|jr|, completing the proof of Theorem 5.5.1. □ 

5.6 Languages Predictable by DPDAs 

In this section we exactly characterise the classes of languages that can be predicted by DPDAs. 

Theorem 8.6.1 The DPDA-predidable classes 0/ languages are exactly the finite classes of 
deterministic contest-free languages, 

Proofi We first prove that any finite class of deterministic context-free languages is pre- 
dictable. 
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Lemma S. 8.2 Let C = {L^L?,. . .,L C ) be a finite class of deterministic context-free lan- 
guages. There exists a predicting DPDA M for the class C. 

For each I», let Af< be a DPDA that accepts by final state. As in the proof of Lemma 5.3.2, 
we assume that each Mi accepts all strings in Li and rejects all strings in E* - Li. A similar 
result can be shown if we allow each language Li to be denned over its own alphabet E<, and 
assume that Af, accepts all strings in Li and rejects all strings in E* - X<. For clarity of 
' presentation we prove the result as stated in the lemma. 

Procft The lemma is proved by constructing a predicting DPDA M for C, using the 
accepting DPDAs. Af first simulates Mi and makes all of its guesses based on whether the input 
strings are in L\. If Af mates an incorrect guess it then starts simulating Afj and makes its 
guesses based on the language L%. This continues until Af finds the right language, after which 
point all of its guesses will be correct. We introduce the additional states flight iVrtghti • • • > flight 
and qlrrong, q^rtmgi "■•> ^torong m order to keep track of which accepting automaton is being 
simulated and whether the most recent guess was right or wrong. 

For each » such that 1 < i < c, let Mi - E, I\, Si, q\, Z[ y Fi) be a deterministic PDA that 
accepts Li by final state, where the set of states is Qi = {q\iq\i->-iQ\Q.]}i the tape alphabet 
is r» = {Z\j Zj, . . Zynih the set of accepting states is F,-. Without loss of generality, we 
can assume that no e-moves are possible firom any state in Fi and that Mi reads its entire input 
[40]. 

We define the predicting DPDA M m (Q, E U {r , to}, T, A, S, A, q.taru Zbotum) as follows. As 
for all predicting DPDAs, r, to g S and A = {+,-}. Let 

e 

Q - U Q* U farfarti 5h 8 />*» ffri^W* • • «i £t fl /iii 9wr<mfl> £*>r<mfli • • •» Swony } 
•si 

and r as \JH sl Ti U {Zbottcm}- The function A is denned as follows. For any state q\ £ Qi 



We define X(q) = u - " for any state 4 in 

• • • * Q wrong}' 

All of the transitions in 01,03*. are also in S. In addition, 6 contains the following 
transitions. 
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1. %#tart,«,Zfc*tom) = {(qhZbottomZl)}' 

2. For every » such that 1 < » < c, for every state gj 6 Q<, and for every Z € IV, 6{q)> r, Z) = 

0} »»d l(fj t r v 2| aMM ) = {(«S»^tettem^i)}. 

3. For every t such that 1 < * < e - 1, for every state cj e Q*, and for every Z 6 I\, 
%}> to, Z) = {(gLm,, «)} and 6(g), to, Z**^) = Z 6o4tom Zi +1 )}. 

4. For every t such that 1 < i < e and for every Z 6 I*,-, ffaj^e, Z) = {(9^, «)} and 

5. For every t snch that 1 < i < e - 1 and for every Z € I\, tf^^e.Z) = {(^ony. «)} 
and ffoL**. «, Zbrtt™) = {(f^ 1 , Z 6ettem Zj +1 )}. 

Let Li be any language in C. We prove that Af is correct for £<; it then follows immediately 
that M is a predicting DPDA for C. Let <r = o* lf <r 3 , . . . be any infinite sequence of finite strings 
over S. Consider the string PRBSjf {Li,<r) = aibi^h • • •• 

Claim 5.8.3 The number ofw'smthe sequence 01,62, ... is less than i. 

Proof of Claims Suppose that 6 m is the (« - l) st occurrence of w in 01,63,. . - We show that 
no more to 'a will appear. Note that, by the definition of the transition function 0", all states 
entered by M between the s th and (0 + l) st appearances of to in PILES at (I» o*) are states 
from Qs+i and thus that immediately after reading the s*^ occurrence of 10 and exhausting all 
possible e-moves M is in state gj +1 . Similarly, as soon as the 0 th to in the input is read all 
stack symbols are popped and only symbols in r,+i u {Zbottom} are pushed, until such time 
as another to is encountered. During the entire computation Zbattom appears only once in the 
stack, at the bottom. Thus immediately after reading o^o^o* . ..<r m b m and exhausting all 
possible e-moves, M is in state q\ and the stack contents is ZbottomZ\. Suppose that the string 
<r m+1 is in Li- Then, since q\ is the start state and Z[ the start symbol of Afc, which accepts 
Li, and since all of the transitions in o< are in 0*, 

{qxiOm+iiZkttomZi) (g,«,7) 

for some q € f< and 7 € Ztetoml?. By assumption, no <-n?jves are possible in Af< from any 
state in Fa thus foTOPfr)) € O. By the definition of A, X(q) = « + Since <r m+ i g Z<, the 
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value of bm+i will be *r n . Similarly, suppose that <r m +i £ Li. Since Mi reads all of its input 
there is some q € Qi and 7 € Ztetoml? such that (g,TOP<7)) € 0 and 

Since (7 m+l £ L< and since Af< accepts A, 9 £ F<. Thus X(q) = " - B , and the value of 6m+i 
will again be V. Note that in either case, by the definition of S, Af will be in state q{ with 
stack contents Z bottom Z\ immediately before the first character of <r m +i is read. An obvious 
induction shows that each of Om+i, sWst ... is an V, proving the claim. □ 

Hence only a finite number of the 6**s are to*s, so there exists some &o such that for all 
k > ko,b k = "r". Thus Af is correct on (£<,<r). Since a was chosen arbitrarily, AT is correct 
for Li, and the lemma is proved. □ 

Note that the proof above gives a technique for constructing a predicting DPDA for any 
DPDA-predictable class C that makes at most \C\ - 1 incorrect predictions. 

It remains to be shown that every DPDA-predictable class is a finite class of deterministic 
context-free languages. Lemma 5.6.4 states that all languages in a DPDA-predictable class 
must be deterministic CFLs. 

Lemma 5.6.4 Let C be a DPDA-predictcble class, and let L be a language in C. Then L is a 
detern>inistic context-free language. 

Proofi We prove the lemma by constructing a DPDA Af' that accepts I from a predicting 
DPDA Af for the class C. Af first enters a home configuration for L, and then simulates Af. 
It accepts or rejects the input string based on the guess o ltput by Af. 

Let Af = (<?, Eu{r, tp},I\ A, S t A, fc, Zi) be t predicting DPDA for C, with Q = {ft,..., q\q\} 
and r s {^i,Z 2 ,. ..,Z| r |}. We define a DPDA Af' that accepts L. Let [g*,7fc] be a home 
configuration for L; at least one such configuration exists by Lemma 5.5.3. Then define 
Af' m (Q U {fetor*}, S,T U {£*to*}, fete**, 2#<«rt, F), where the set of accepting states F 
is defined by 

* = {*€<?:%)-« + "}. 

The transition function 6' contains all transitions of S that don't involve the symbols r 
or to, as well as the transition d'{qgtmrt^^ttart) = {(ft*7fc)}* Thus at the beginning of any 
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computation, M* enters ih], a home configuration for L. It then simulates M on the input 
string (in £*). To see that M' accepts L, let x be any string in E*. Let 9 6 Q and 7 € T+ be 
snch that (g,TOP(7)) € O and 

($fc»*»7*)^* ($»«,?). 

Since fefc,7*] » a home configuration for i, it is exactly corrvct for L. Tlx £ L, then by the 
definition of exact correctness, A($) = B + B , so 4 € F. K z $ L then A( q) = u - so q $ F. 
Thus Af' accepts 1, so £ is a deterministic CFL. □ 

Any predicting DPDA must have a finite number of states and finite tape and input alpha- 
bets; thus by the lower bound of Theorem 5.5.1, the number of languages in a DPDA-predictable 
class must be finite. Hence the only DPDA-predictable classes are the finite classes of deter- 
ministic context-free languages. This concludes the proof of Theorem 5.6.1. □ 

S.7 Prediction Using Counter Machines 

An interesting special case of a deterministic PDA is a 1 -counter machine (1CM). A 1CM is 
a DPDA with only two stack symbols, 0 and 1. Furthermore, the symbol 0 is used only as a 
bottom-of-stack marker; it always appears exactly once on the stack, at the bottom. Thus the 
stack functions as a counter: it stores a nonnegative integer, corresponding to the number of l's 
on the stack. A counter machine, like any DPDA, makes transitions based on the current state, 
current input symbol (it can also make e-moves), and the symbol on top of the stack. In the 
case of a 1CM, the latter is equivalent to checking whether the number stored in the counter 
is zero or positive. Similarly, a k-counter machine (kCM) can be defined for any nonne gative 
integer k. Such a machine has k stacks are described above, each of which functions as a 
counter. The transitions in a e-counter machine depend on the state, input symbol, and which 
of the counters store positive numbers. Note that a C-counter machine is a DFA. ft has been 
shown that any luring machine can be simulated by a 2-counter [40, 56] *br any A, 

a k-counter language (kCL) is a language that is accepted by some e-counter machine. 

By making the necessary adjustments to the definition of predicting DPDAs, we can define 
a predicting 1-counter machine in the obvious way. Since it is a straightforward restriction 
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of the general DPDA definition, a formal definition is omitted. Similarly, the definition of a 
class of languages that is 1-counter predictable, as well as other related definitions, is exactly 
analogous to the definition for DPDA-predictabiHty, and thus omitted. 

Theorem 5.7.1 The 1 CM-predictable classes of languages are exactly the finite classes of 1- 
eounter languages. 

Proof Sketclu Since 1- counter machines are a restriction of DPDAs, the result of Theo- 
rem 5.5.1 implies that only finite classes are lCJtf-predictable. By an argument similar to the 
one given in the proof of Lemma 5.6.4, all languages in any 1 CM-predictable class are 1-counter 
languages. The following lemma states that any finite class C of lCLs is 1 CM-predictable, 
completing the proof. Q 



Lemma 5.7.2 LetC - {Li* X 2 , . . • , Ijq} be a finite class of 1-counter languages. There exists 
a predicting 1CM M for the class C. 

Proof Sketch: The proof is similar to the proof of Lemma 5.6.2. By arguments similar to 
some in [40], we can assume that for each Li there is a 1CM Mi that accepts X< that reads all 
of its input, and such that no e-moves are possible from any accepting state. We construct a 
predicting 1CM Af for C, using the accepting 1-counter machines. M first simulates M\ and 
makes all of its guesses based on whether the input strings are in L%. If M makes an incorrect 
guess, it then starts simulating M 3 , and makes its guesses based on the language L 2 . This 
continues until M finds the right language, after which point all of its guesses will be correct. 
As in the proof of Lemma 5.6.2, extra states are used to keep track of which machine is being 
simulated and whether the last guess was right or wrong. The bottom- of- stack symbol 0 in the 
counter mi*h» takes the place of the stack symbol 2&ottom m that proof. Alter M outputs a 
guess and reads the character indicating whether or not it guessed correctly, it pops all l's off 
the stack until just the sero remains, using e-moves. The current state contains the information 
as to which automaton has just beer, simulated, as well as whether the guess just made was 
correct. Using this information, M will either simulate the same machine or move on to the 
machine for the next language in the class. □ 
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5.8 Discussion 

It is perhaps not surprising that only finite classes can be predicted by deterministic finite 
automata in this model; after all, there is no infinite component in a DFA. It is, however, much 
less intuitive that 1-counter machines, and even deterministic pushdown automata, which have 
stacks that are allowed to grow without bound, are unable to predict any infinite classes of 
langua ges (not even an infinite class of singleton languages). In fact, even though a predicting 
DPDA can make use of such a stack, the size of the classes that can be predicted by DPDAs 
only exceeds the size of the classes predictable by DFAs (with the same number of states) 
by a factor of about 2|r|. Thus, although the stack is useful for allowing the prediction of 
classes containing more complex languages, it is much less effective at enabling the automaton 
to predict larger classes. The additional number of languages that can be predicted by DPDAs 
can largely be explained by the availability of only the top-of-atack symbol, which effectively 
increases the number of states in a DFA by a factor of |F{. 

It is interesting to note the hierarchy of predictive power for counter machines. The classes 
;hat can be predicted by 0-counter machines (DFAs) are the finite classes of regular languages. 
Similarly, 1-counter machines can predict exactly the finite classes of 1-counter languages. 
However, when the number of counters reaches two, the number of predictable classes grows 
considerably. As was mentioned above, 2- counter machines are as powerful as Turing mach ines. 
Thus prediction by 2CMs in this model is equivalent to NV-extrapolation [6, 9, 19]. A result in 
[7] shows t any recursively enumerable class of recursive functions (as well as any subclass 
of such «. .js) can be NV-extrapolated. Thus, although the difference in predictive power 
between 0- and 1-counter uiarhfa** is relatively slight, an enormous difference exists between 
the ability of 1- and 2- counter wwhfa— to predict classes of languages. 
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6 ONLINE ALGORITHMS FOR VERTEX LABELING 
PROBLEMS 



An online algorithm is an algorithm that is given a series of discrete inputs, and most make some 
irrevocable decision after seeing each input. An online graph algorithm is an online algorithm 
in which the inputs are pieces of a graph r *ad the decisions are (usually) determining wfc %t label 
to assign to a vertex. At least two online graph problems have been studied in some detail. 
Li [33] and [51] online algorithms for ct taring the vertices of a graph are considered. The 
problem of constructing chain covers and antichain covers of partially ordered sets online has 
been has been studied in [49] and [50]. Online algorithms for a variety of other problems, such 
as packing problems [16, 77], dynamic storage allocation (e.g. [23]), and metrical task systems, 
includi ng server and caching problems [15, 21, 55, 65], have also been investigated. Work done 
on recursively colorable infinite graphs [8, 18, 29] is related to online graph algorithms. Update 
alg< rithms, in which graph properties are updated following incremental changes to the graph, 
also h&ve much in common with online graph algorithms [26, 27, 41, 42, 68]. 

The problems considered here are a class of graph problems that we refer to as vertex labeling 
problem*. In these problems, the objective is to assign labels to the vertices of a graph such 
that the labeling satisfies certain properties. A particular labeling is evaluated according to 
some criterion, such as the number of different labels used (as in the vertex coloring problem) 
or the number of vertices to which a particular label is assigned (as in the dominating set 
problem). The goal is to find an algorithm that always produces a good labeling according to 
the criterion. Vertex labeling problems easily lend themselves to an online protocol; at each 
stage, an online algorithm must make an irrevocable decision as to what label a particular 
vertex should be assigned. For many such problems, however, it is unrealistic to expect that an 
optimal i*h*i*"g can always be found online; a more reasonable approach is to search for online 
algorithms whose worst.case p erform ance is always bounded by some function of the optimal 
labeling and perhaps some other parameters of the graph. 

The protocol most oft sn used for online graph algorithms is as follows. The algorithm A 
is given input, and mo< produce output, in n stages, where n is the number of vertices in 
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the graph G. At the tth stage, A is told which of the vertices vi,«s,. .., v;_i the vertex «j 
is adjacent to. A mast then output the label to be assigned to «,% Thus the algorithm must 
make <tn irrevocable decision about the label of Vi having seen only the subgraph of G induced 
by {t>i, «2, . . .,»<}. Note that no restrictions are placed on how much time A is allowed to use 
before making its decision. The performance of A is measured by how good the labelings it 
outputs are relative to the best possible (offline) labeling. 

This protocol is too restrictive for any algorithm operating under it to achieve good results 
for any of a number of vertex labeling problems, including the independent set, vertex cover, 
and dominating set problems, for each of these problems, it is trivial to establish upper 
bounds on the (worse-case) performance of such algorithms that are little better than the worst 
performance level possible. In order to achieve any reasonable performance guarantees for these 
problems it is necessary to remove some of the restrictions tha f ; are placed on algorithms by 
this protocol. For this reason, we define two new online protocols for vertex labeling problems 
that will permit us to study how well local heuristics work for these problems. 

We will refer to the standard online protocol described above as Protocol 1. The first of 
the new protocols, which we call Protocol 2, is as follows. An online algorithm A operates in 
the same manner as a Protocol 1 algorithm, with the following exception. At the zth stage, 
rather than being given as input a list of the vertices in {t?i, w 2 , . . . , that are adjacent to 
the vertex v,-, A is given as input a list of all of the vertices in the graph that are adjacent to »<. 
Thus at each stage A has more information available to it than merely the subgraph induced 
by the set of vertices that have already been labelel. 

The ether new protocol, referred to as Protocol 3, allows an algorithm A to have the sam e 
information for each vertex that it is permitted under Protocol 2, r. d in addition allows A 
to select at each stage the vertex that it would like to label next. Since (unlike the case 
in Protocol 1) A may have information about vertices not yet labeled, this is a meaningful 
difference. (It can easily be shown that allowing A to choose the next vertex would be of no 
advantage if, as in Protocol 1, A only had information about the vertices it had already labeled.) 
As was the case with the original protocol, the performance of an algorithm operating under 
one of these new protocols is measured by the quality of the labelings it outputs relative to the 
best possible (offline) labeling 
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One reason that research on online algorithms is interesting is that it may offer hints as to 
the limits of what can be achieved by algorithms that use only local heuristics (Le«, algorithms 
that are strictly online or else use only a modest amount of lookahead), as opposed to global 
ones. For some graph problems in P (e.g M finding a minimum-weight spanning tree) there are 
efficient algorithms that use only local heuristics. Because of the fact that local heuristics can 
usually be implemented efficiently, they are often used to try to find approximate solutions 
to problems for which finding the optimal solution is hard. By considering online algorithms 
for NP- complete problems we can study how well local heuristics work for (apparently) more 
difficult problems. Recall that there are no restrictions on the amount of time and space that 
an online algorithm is permitted to use. Thus studying the performance of online algorithms 
for NP- complete problems may lead to a better understanding as to what extent, if at all, 
additional computational resources can compensate for having only local knowledge of a graph. 

In the remainder of this chapter we investigate how well online algorithms perform on 
several vertex labeling problems. First, we consider online algorithms for the graph bandwidth 
problem. Next, we look at online algorithms for several problems that are a particular type of 
vertex labeling problem that we call vertex subset problems. These include the independent 
set, vertex cover, and dominating set problems. 

6.1 The Online Graph Bandwidth Problem 

In this section we investigate the performance of online algorithms for the graph bandwidth 
problem. 

The study of band widths originally arose in connection with matrices, but was readily recast 
as a problem in graph theory. The problem of finding the bandwidth of a graph is to detennine 
the smallest possible value k such that there exists a bijective function / from the vertex set 
V to the set {1, 2, . . |V|} with the property that if two vertices have an edge between them 
then the difference of their images under / is no more than k. The problem of determining the 
bandwidth of an arbitrary graph is known to be NP-complete [59]. See [20, 22, 76] for farther 
results on the graph bandwidth parameter and its extensions. Graph bandwidths also arise in 
the study of VLSI circuit design. 

We are interested in the problem of finding online algorithms that construct a function / 
with as small a bandwidth as possible for arbitrary graphs. It is not possible to always find 
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the minimum possible bandwidth online; thus we try to imd a function with a bandwidth not 
too much larger than the minimum. No restrictions are placed on the computational resources 
(time and space) available to the algorithms. We do not consider infinite graphs. 

An application of this particular problem is as follows. Suppose we receive some data files in 
a sequential manner, and must write each file onto a sequential tape as it arrives. The files can 
be placed anywhere on the tape, but we want them positioned so as to minimize the longest 
distance that the tape head must travel between files when the data files are subsequently 
accessed. If the pattern of anticipated data accesses is such that it can be modeled by a graph, 
the the problem of deciding where to put each file as it arrives ran be modeled by an onlin e 
graph bandwidth problem. 

Turner [72] also studied approximation algorithms for the graph bandwidth problem. How- 
ever, he does not consider online algorithms, and he assumes an underlying probability distri- 
bution over the possible graphs and uses an average- case performance analysis. We analyze 
online algorithms in terms of their worst -case performance. 

The outline of this section is as follows. We first present an online algorithm (that operates 
under Protocol 1) for the bandwidth problem and demonstrate that its performance close to 
optimal We then define the two new, less restrictive protocols for online graph bandwidth 
algorithms, and prove lower bounds on the bandwidth of the function constructed by any 
algorithm that operates according to these protocols. 

0.1.1 Notation and Definitions 

Let G be a simple finite undirected graph with vertex set V = {t>i,t>j.. .,?„} and edge set E. 
Note that |V| = n. If (u,t>) e E then u and v are adjacent For any v,- € V we define the 
adjacency list for as Adj(vi) = {it : (v u u) € E}. We define the restricted adjacency list for 
iH as Adji{vi) = Adj{vi) n {v u vj . . . , o,_i}. 

Definition 6.1.1 For any integer m, I < m < n - 1, an m-bandwidth function for a graph 
G is a bijective function f: V -* {1,2. ..,n} with the property that for any edge (u,t>) in E, 
\f{v)-f[y)\ < m. Alternatively, a function with this property may be said to have bandwidth m. 
If a function f is an m-bandwidth function for some m, then f is a bandwidth function. The 
bandwidth of G is the smallest positive integer k such that there exists a k -bandwidth function 
for G. 



89 

9fl 



The size of a graph's bandwidth gives information about how the vertices in the graph are 
connected. In a graph with a small bandwidth, the vertices tend to have edges only to vertices 
in the same part of the graph, while a graph with a large bandwidth has edges between vertices 
in different parts of the graph. Thus the bandwidth measures certain locality properties of the 
edge set. Note that if G has bandwidth *, then no vertex in G can have degree greater than 
2k, In particular, if G has bandwidth 0, then there are no edges in C, so any bijective fraction 
from V onto 1,2, . . n is a 0-bandwidth function for <7. We will assume that the edge set E is 
not empty, and thus G has bandwidth k > 1. 

The problem studied here is the construction of an m- bandwidth function / by an algorithm 
A when A is given its inputs, and outputs the values of /, according to an online protocol 
(defined below). 

Definition 8.1.2 An algorithm A is an online bandwidth algorithm if its input/output behavior 
is as follows. Initially A is given as input (for some graph G) the number of vertices n and 
the bandwidth k. Then, for some ordering of the vertices, t>i, t> 2 , . . r n , A is presented the 
restricted adjacency lists of the vertices in that order. After the list for t>, is seen, A must 
output the value of f(vi) before it is shown Adji+i{vi+i). The decision made by A as to the 
value of f[vi) is irrevocable. When all of the restricted adjacency lists have been seen by A, it 
must have defined the values of f such that f is a bandwidth function for G. 

Definition 8.1.3 An online bandwidth algorithm A is an online m-bandwidth algorithm if, 
for any n and k, for any graph G with n vertices and bandwidth k, and for any ordering of the 
vertices ofG, the function f defined by A is an m-bandwidth function. 

Note that it is trivial to find an online (n - l)>^ndwidth algorithm. In fact, any algorithm 
that produces a bijective function from V onto {l,2...,n} (according to the online protocol) 
is an online (n - 1) -bandwidth algorithm. 

This definition of an online algorithm is generally similar to the method of presenting graphs 
and partially ordered sets used mother work on onlme graph algorithms. One difference between 
our definition and the protocols used in the problems of online graph coloring and recursively 
covering posets with chains /antichains is that we allow the algorithm to knot.' the number of 
vertices in the graph. In the coloring and poset problems, the objective is to construct a function 
with domain V and a range as small as possible, provided that it satisfies certain constraints. 
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In the bandwidth problem, however, the range of the function must be the same size as V; thus 
any online algorithm would be severely handicapped if it did not know what the range was 
required to be. Note that we also permit an online bandwidth algorithm to know in advance 
the actual bandwidth of the graph. Because of the stringent requirements on the algorithm (i.e. 
that it must construct a bijective function with the desired properties based on only partial 
information about the graph) we feel that it is not unreasonable to provide the algorithm with 
this information. 

6.1.2 An Online Algorithm for finding the Bandwidth of a Graph 
Theorem 9.1.4 There exists a Protocol 1 online \2h=*jpJ* -bandwidth algorithm. 

Note that an alternative way to phrase the problem and the above result is as follows. The 
definition of an online bandwidth algorithm could be changed to drop the condition that the 
algorithm be given the value of k. Then the above theorem could state that for any n and &, 
there is an online ^ 2h ~^ n+1 -bandwidth algorithm for the set of all graphs with n vertices and 
bandwidth k. 

Proofi Define B(n,k) - i2*^2±i. The online algorithm OLBW (Figure 6.1) computes a 
B(n, ib)-bandwidth function /. 

Note that the algorithm OLBW sets /(*,) equal to the unused value in {1,2 . .., n} furthest 
from /i that is still consistent with an eventual online band-vidth of B(n, k). Since n = p 
is the "middle" of 1,2,..., ri . 

Definition 6.1.5 Let a and 0 be elements of {1, 2 . . ., n}. a is more extreme than 0 (or 0 
is less extreme than a) if \a - p| > \0 - p\. a is at least as extreme as 0 (or 0 is no more 
extreme than a) if \a - p\ > \0 - fi\. 

in the following we will frequently refer to the assignment of an image under / to a vertex 
Oj aa "labeling v? or "giving vt a label". Similarly, elements in LABELS will be referred 
to as "unused labels" or "available labels", while elements of {1,2..., n} that are no longer in 
LABELS will be referred to as "used labels". Thus OLBW sets /(«<) equal to the most extreme 
unused label that is consistent with / having a bandwidth of at most B(n,k). 

/ is well-defined and bijective. To show that / has bandwidth £(n, *), we will assume 
otherwise and show that a contradiction inevitably arises. 
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Algorithm OLBW 

1. Set LABELS = {1, 2, . . »}. 

3. Set jli s ^j*^ . 

3. Fox each i = 1, 2, . . . , n do: 

(i) Define /(*<) s z, where z is the element in LABELS 
that maximizes \z — p{, subject to the constraint that 
fox each vj € Adji{vi), 

\fi«)-fM\<B(n,k). 

In case of ties, choose the smaller value. 
(3) Set LABELS = LABELS -{*}. 

Figure 8.1: Online algorithm to find a 5(n f £)-bandwidth function 

Suppose that / has a bandwidth greater than B(n, k). Let v 9 be the first vertex encountered 
by OLBW such that labeling v 9 violates the bandwidth constraint; that is, while OLBW is 
processing v # , it finds that there is no element in LABELS that satisfies the constraint in Step 
3(i) of the algorithm. 

Case 0: Adj 9 (v 9 ) = 0, i.e. there are no edges in E between v 9 and any previously* seen 
vertex. Then any label that is given to v M flails to increase the bandwidth of /. Thus the 
constraint of Step 3(i) cannot have been violated by v s after all, and we get a contradiction. 

Cass 1: Adj 9 {v 9 ) — {«}. Thus v 9 has an edge to exactly one previously-seen vertex, which 
we will call u. Since u was processed before v 9y OLBW has already computed /(ti). 

Fact 6.1.8 For any k > l f n - -B(n,*) < B{n,k) + 1. 

Un-Bfak) < /(tt) < B(n, k) + 1, then \f{v $ ) - f(u)\ is at most either n-(n-£(n, k)) = 
£(n, k) or (Bfa k)+l)-l = B{n, k). Hence no label that is given to v, will cause the bandwidth 
of / to exceed B(n, k). Thus we need only consider the cases when f(u) < n - 2?(n, k) or 
/(«)>*(*,*) + !. 

Case 1-a: f(u) < n~B(r\k). Let t be the largest integer such that all of the labels from 
1 to t have already bv en used; thus t + 1 is the smallest unused labeL Let m = t + 1. Note 
that if there were any unused labels between 1 and B(n, k) + 1, then v 9 could be given or ^ 
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of those labels. By Fact 6.1.6, n - B(n,k) < B(n,k) + 1, so - f(u)\ would be at most 
(B(n,fc) + 1) - 1 = B(n,k), and the bandwidth of / would not be forced to exceed B{n, k). 
Thus we can assume that t > B(n, k), and hence m > B(n, k) + 2. 

lor any used label p, let f~ l {p) be the vertex that has been assigned the label p by OLBW. 

Consider the set P = {1,2,...,*}. All of the elements of P are labels that have already 
been used. 

Definition 6.1.7 We define the seta P % and P 2 as follows. 

• P\ is the set of labels p in P such that p is at least as extreme as m. 

• P2 is the set of labels p m P such that there is an edge in E from / _1 (p) to a vertex that 
has already been given a label less than m - B(n, k). 

Lemma 6.1.8 P x u P 2 = P. 

Proof: Consider any p € P. Let [j < s) be the vertex f~ l (p). By Step 3(i) of the 
algorithm OLBW, p was at that time the most extreme element in LABELS that would not, if 
assigned to v,, force the bandwidth of / to exceed B(n, k). Suppose that p $ P u so m is more 
extreme than p. Then the reason that Vj was gi ren p, rather than m, as a label by OLBW 
must have been because assigning m to Vj would make the bandwidth of / too large. Since 
m > B(n, k) + 2, the only way that this could happen would be if there was an edge from Vj to 
a vertex that had already been assigned a label smaller than m - B(n, k). Thus p e Pi. □ 

Lemma 6.1.9 |Pj| = n - m + 1. |P 2 | < 2k(m - B(n^k) - 1). 

Proofi Since m > B(n, *) + 2, m is greater than n/2. Thus the labels that are at least 
as extreme as m are l,2,...,n - m+ 1 and m,m + l t ...,n. Since m = t + 1, the only such 
labels that are in P are l,2,...,n - m + 1, proving the first part of the lemma. The number 
of vertices that have already been assigned a label smaller than m - B(n, k) is clearly bounded 
by m - B(n, k) - 1. Since G has bandwidth k, each such vertex can have degree no more than 
2k, proving the remainder of the lemma. □ 
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Define P l5 = i\nft, Clearly IPjjl <\Pi\=n-m + l = n-L Thus 



jP 2 | > t - \P a \ > t - (n - e) = 2* - n. 



(6.1) 



By Lemma 6.1.9, since to = f + 1, 



m < 



2k\m- B(n,fc)- 1) 

2fcm - ((2Jfe - l)n + 1) - 2k 



2kt - (2* - l)n - 1. 



Since it > 1 and t < n, 2(Jb - l)t - 1< 2(i - l)n. By algebra, 2kt - (2* - l)n - 1< 2t - n, 
so |P 3 | < 2* - n, contradicting (6.1). Sine* we get a contradiction, this case cannot arise. 

Case 1-b: /(«) > B(n,k) + 1. Since this case is symmetric to Case 1-a, the exposition 
will be shorter. Define t to be minimal such that all of the labels t, t + 1, . . ., n have already 
been used. Let to = t - 1, the largest unused label. If there were any unused labels between 
n - B(n, k) and n, then v, could be assigned one of them, without forcing /'s bandwidth to 
exceed 2?(n,&), by Fact 6.1.6. Thus assume that t < n - B(n,k) and m < n - i?(n,Jfe) - 1. 
Define P m {M + l,...,n}. 

Definition 6.1.10 fPe define the sets P 3 and jf> 4 as follows. 

• P 3 is the set of labels p in P such thai p is at least as extreme as m. 

• Pi is the set of labels p in P such that there is an edge in E from f~ l {p) to a vertex that 
has already been given a label greater than m + B(n, k). 



Lemma 8.1.12 \P%\ = to. \P 4 \ < 2k(n - m - J?(n,fc)). 

Proofi Since m < n - J5(n, k) - 1 < }, the only labels in P that are at least as extreme as 
to are n — to + 1, n — to + 2, . . ., n. There are to of these, proving the first equality. No more 
than n - to - P(n, k) vertices can be assigned labels larger than m + B(n, k)\ each such vertex 



Lemma 6.1.11 P$ U P< m P. 



Proof: Similar to proof of Lemma 6.1.8. 



□ 
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lias degree no more than 2k. This proves the rest of the lemma. 



□ 



Let = P 3 nP^. Then \P a \ < \P a \ = m = t - 1 and 

> W - \ °X\ = n - t + 1 - IP.*}. (6.2) 

By Lemma 6.1.12, 

|P 4 | < 2fc(n-m-5(n,4)) 

= 2kn - 2*m - (2* - l)n - 1 
= n - 2fct + 2* - 1 
= (n - 2t) - (-2* + 2*t - 2k + 1) 
= (n - 2t) - (2t(Jk - 1) - 2k + 1). 

Since t > 2 (because not all labels can have been used already) and > 1, 

2r(* - 1) - 2k + 1 > 4{k - 1) - 2k + 1 = 2k - 3 > -1. 

Thus 

)P 4 | < (n - 20 - (-1) = n - 2t + 1. 

Since jPgjj < t-1, 

1^41 < n - 1 - IP34I < n - « + 1 - |jy , 
which contradicts (6.2). Hence this case cannot arise. 

Case 2: \Adj M (v,)\ > 2; v, has an edge to two or more previously-seen vertices. Let / 
and r be the smallest and largest labels, respectively, among all vertices in Adj,[v,). (If the 
labels 1,2,. ..,n are thought of as being written in ascending order, then / is the "leftmost", 
and r the "rightmost", label of any vertex in Adj t {v,).) Note that r - / < n < 2B{n, *), so 
r - B{n,k) < I + B{n^k). Any label between max{l,r- B{n,k)} and min{n,/ + B{n,k)}, 
inclusive, is within B(n, k) of of both I and r. Thus all such labels must have already been used 
since, by hypothesis, any available label that is assigned to v, causes fa bandwidth to exceed 
B(n,k). 

We split this case into four subcases. 
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Case 2- a: r- B{n,k) < l and I + U(n, *) > n. Thus all of the labels 1, 2, . . n have been 
used already, so all of the vertices have been labeled. There is no v, left to label. 

Case 2-b: r - B(n,k) < 1 and / + B(n, k) < n. Thus all of the labels from 1 through 
/ + B(n, k) have been used. 

Let t be WftT1 ' mftl such that all of the labels 1, 2, . . ., £ have been used; thus t > J?(n, k) + 1. 
If the argument in Case 1-a is repeated using u 6 Adj t (v,), instead of Adj t (v t ) = {u}, thun 
this situation is seen not to be achievable; hence this case cannot arise. 

Case 2-C: r - B(n,k) > 1 and / + B(n,k) > i. Thus all of the labels from r - B{n,k) 
through n have been used. 

Let t be minimal such that all of the labels r, t + 1, . . ., n have been used; thus t < r - 
i?(n, Jfe) < n - 2?(n, Jfe). If the argument in Case 1-B is repeated with u € Adj t (v t ), rather than 
Adj t (v M ) = {u}, then this situation is seen to be impossible; hence this case cannot arise. 

Case 2-D: r - B(n,k) > 1 and I + B(n, Jfe) < n. Thus all of the labels from r - B{n,k) 
through / + B(n, Jfe) have been used. 

Define a to be minimal, and b maximal, such that all of the labels a + 1, a + 2, ...,6-2, b-1 
have been used already, and 

{r-£(n,*),r-£(n,ife) + l,...,/ + .B(n,Jfe)} C {a + l,o + 2,. . - 1}. 

Note that a and b have not yet been used, and that a <r - B{n, k) and 6 > / + JB(n, k). Let 
P = {o+ l,o + 2,...,6-2,&- 1}. 

Definition 0.1.13 We define the sets Ps, P«, P 7 , and P» as follows. 

• Pi is the set of labels p in P that satisfy both of the following conditions: 

1. p is more extreme than a and more extreme than b. 

2. f~ l (p) is not adjacent to any vertex that has a label either greater than a + B(n, k) 
or less than b — 2?(n, Is). 

• Ps is the set of labels p in P that satisfy the following three conditions: 

1. p is more extreme than a. 

2. f~ l {p) is not adjacent to any vertex that has been given a label greater than a + 
B{n,k). 
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3. f~ l {p) is adjacent to a vertex that has been given a label less than b - B{n, k). 

• Ft is the set of labels p in P that satisfy the following three conditions: 

1. pis more extreme than b. 

2' is adjacent to a vertex that has been given a label greater than a + 2?(n, k). 

f~ l {p) is not adjacent to any vertex that has been given a label less than b - B(n, k). 

• P 8 is the set of labels p in P that satisfy both of the following conditions: 

I- f~ l {p) is adjacent to a vertex that has been given a label greater than a + 2?{n, k). 

2. f~ l {p) is adjacent to a vertex that has been given a label less than b - B{n, k). 

Lemma 8.1.14 \P\ = |P 6 | + \Pj\ + \P a \. 

Proof: Each p € P was selected by OLBW as the label for some vertex f~ l {p), rather than 
a or b. The possible reasons that p was chosen instead of a or b are as follows. 

1- f~ x {p) is adjacent to both a vertex with a label more than B(n,k) away from a and a 
vertex with a label more than B{n, k) away from b. Thus neither a nor b would have 
been chosen instead of p. Note that since o < r - £(n, i) < n - B(n, k) < B{n, k) + 1 
(by Pact 6.1.6) that the vertex with a label more than 2?(n, k) away from a must have a 
label greater than o. Similarly, observe that b > I + B(n, k) > I + B{n,k), so n - b < 
n - B(n, k)-l< B(n, *)* by Fact 6.1.6. Hence the vertex with a label more than B{n, k) 
away from b must have a label less than b. Any such p is contained in P%. 

2. /~*(p) i» adjacent to a vertex with a label more than B{n,k) away from a, so o would 
not have been chosen. Furthermore, p is more extreme than 6, so b would not have been 
chosen. Any such p is contained in P 7 U P 9 . 

3 - f~ l (p) is adjacent to a vertex with a label more than Bfak) away from so 6 would 
not have been chosen. Furthermore, p is more extreme than a, so a would not have been 
chosen. Any such p is contained in P« u P$. 

4. The only other possible reason would be that p is more extreme than both a and 6. Since 
a < p < 6, this is impossible. Thus P& = 0. 



97 

107 



Thus 

PCftuP 4 u/VuP 8 = P 8 u^uP 8 . 

Since P 6 uPrU ft C P, we have P = P 6 uPrUP 8 , and thus jPj = |P«uPrUP 8 |. It is immediate 
from their definitions that P«, P 7 , and P« are disjoint sets. Therefore 

|P| = |P S | + |Jt| + |P»|. 

□ 

We now make one (final) ease subdivision, this time depending on which of a and b is more 
extreme. 

Case 2-d-i: b is at least as extreme as a. Thus a + 6 > n + 1, so 6 > n-a + 1. Note 
that the elements of Pj U P s are the labels in P that have been assigned to vertices with edges 
to vertices whose labels exceed a + B{n,k). Since the maximum degree of any vertex in V is 
2k, there are at most 2k distinct elements of P7 U P& for each vertex with a label exceeding 
a + P(n, Jfc). Hence 

jiVj + |P 8 |<2*(n-a-P(n,*)). 

The labels in P« are a subset of the set of labels in P that are strictly more extreme than a. If 
b > n — o + 3, then the only labels in P more extreme than a are n-a + 2, n-a + 3, — 1. 
If 6 equals n-o + 2orn-o + l (recall that b can be no smaller than this) then no labels in 
P are more extreme than a. Thus 

|P e J < max{(o- 1) - (n - a + 2) + 1,0} = max{6 - n + a - 2,0}. 

Since a + 6 > n + 1, o-n + o- 2>-l. Thus o-n + a- l>0, so 

|P 6 | < max{6 -n + o~ 1,0} = 6-n + a-l. 

Therefore 

|P| = |P 6 | + JPrl + 1*1 

< 2*(n-a-S(n,A)) + 6-n + o- 1 

- 2kn - 2k* - {2k - l)n - 1 + b - n + a - 1 

= o + o-2*o-2 
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= b - a + 2a - 2ka - 2 

= (*-o-l)-(2<i(*-l) + l) 

< 6 - a - 1. 

But by the definition of P it is obvious that JP| = 6 - a - 1. Thus we have derived a 
contradiction, so this case cannot occur. 

Case 2-D-ii: a is strictly more extreme than b. Thus o + b < n, so a < n - b. The elements 
of P 9 U P 8 are the labels in P that have been assigned to vertices with edges to vertices whose 
labels are less than b - B{n, k). Thus 

\P*\ + \P 9 \<2k(b-B(n,k)-l). 

The labels in P 7 are each labels in P that are more extreme than b. Ua<n-b-l, then the 
only labels in P more extreme than b are a + 1, o + 2, . . n - 6. If o = n - i, then no labels in 
P are more extreme than 6. Thus 

|Pr| < max{(n - 6) - (o + 1) + 1, 0} = max{n - b - 0,0} s n - b - a. 

Therefore 

IP! = IPs! + |Pt| + IPs! 

< 2k{b-B(n,k)-l) + n-b-a 

= 2hb - (2k - l)n - 1 - 2k + n - 6 - a 

= (2* - 1)6 - a + 2n - 2Jbn - 2* - 1. 

Suv.e * > 1 and b < n, 

2(*-l)o<2* + 2(*-l)n. 

Thus (2* - 1)6- o + 2n- 2kn - 2k - 1< 6- a- 1. But since |P| = b - a - 1, we have derived 
a contradiction, so this case cannot occur. 

Therefore, if we assume that v, is the first vertex that OLBW cannot assign a label to 
without forcing the bandwidth of / to exceed B[n, k), we inevitably find a contradiction. Hence 
no such v, can exist, and OLBW always produces a function / with bandwidth no more than 
*(n,Jb). 

This concludes the proof of Theorem 6.1.4. □ 
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Corollary 8.1.15 The above result holds when G is any graph of degree no more than 2k. 

Proof: In the proof above G is assumed to have bandwidth k. However, the only conse- 
quence of this that is used is that G must then have degree less than or equal to 2k. O 

It is clear that if k is large the algorithm OLBW does not guarantee an online bandwidth 
that is necessarily much better than the bandwidth ot'n-l that is trivial to achieve. The result 
in the next subsection shows, however, that the performance guarantee that OLBW offers is 
close to optimal 

6.1.3 A Lower Bound 

In this subsection we give a lower bound on the bandwidth of the function output by any 
Protocol 1 online bandwidth algorithm. 

Theorem 8.1.16 For any n and k, and for any online bandwidth algorithm A, there exists a 
graph G with n vertices and bandwidth k such that the function f output by A has bandwidth 
greater than jfen - 2 . Thus no online ( - 2)-bandwidth algorithm exists. 

Before proving this theorem, we prove the following two lemmas. 

Lemma 6.1. IT Let the graph G consist of the connected components G\, G 2 ,. ..,G m , with 
bandwidths &i, km, respectively. Then the bandwidth of G is max{&i, km}- 

Proof: For each i - 1, 2, . . ., m, let U be a fe-bandwidth function for G,-, and let n< be the 
number of vertices in <?,. The result is witnessed by the bandwidth function /, denned by 

/(tO=/i(to+X>, 

issl 

where j is such that Gj is the connected component containing the vertex v. □ 

Define a graph G to be a star % for some vertex t>, there is an edge from v to every other 
vertex in <?, and these are the only edges in G. 

Lemma 6.1.18 A star with n vertices has bandwidth . 
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Proofi Let G be a star, and let v\ f t*a f . . . f Vn he some ordering of its vertices such that no 
vertex has degree greater than *i. (So *i is the "center" of the star.) Then / is a [| J -bandwidth 
function for <? , where / is defined by 

f Lfj + i if* = i 

f{vi)=< i-1 if2<i<LfJ+l 

I * if*>LiJ+2 

□ 

We now return to prove the theorem. 
Proof of Theorem 6.1.16: Given n, Jfe, and any algorithm A satisfying the hypothesis, we 
will define a graph G with the advertised properties. 

We define G by describing the restricted adjacency lists that A is presented for each vertex. 
Without loss of generality, assume that A sees the restricted adjacency lists for the vertices in 
the order v\ $ tjj, . . . , v n . 

We partition the set of labels {l,2...,n} into three disjoint subsets, L, M, and R. These 
are defined by 

£ ={ 1 ' 2 --f2*T2] + 1 }' 

end 

B = {" ~ lifcTa j ' " - [23^2 J + 1 "}• 

The restricted adjacency lists given as input to A are as follows. Let t>; be the vertex 
currently under consideration. If there are unused labels remaining in L and unused labels 
still in R, then Adjifa) = 0. Otherwise, at least one of L and R has had all of its labels 
assigned to vertices. Define X to be the first of L and R to have all of its labels used. Let x 
be the most extreme label in X such that | A«#(/ _1 (a)) n {ri, vi, ... , v { -i}\ < 2k. Then define 
A4fi(vi) = {/"*(«)}. If no such as exists (Le. if each label in X is assigned to a vertex already 
on 2k edges), then define Adjifa) - 0. 

Tb see that G has the desired properties, assume that X = L (the case of X = R is 
symmetric). Define V L = {v € V : f{v) € X}, V M = {v € V : /(*) 6 Af}, and Vji = {v 6 
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V : /(») € R). Clearly V L% V M , and Vr partition V. Nota that each edge in G has exactly 
one of its vertices in Vl. We want to show that there exists an edge connecting a vertex in Vl 
with a vertex in Vr. Consider the point at which the last remaining label in L was assigned 
to some vertex. At this time there was still at least one unused label in R (recall that we are 
assuming that L was the first of L and R to have all of its labels used). Note that if the number 
of possible edges incident to vertices in Vt is greater than the number of unused labels in AT, 
then the as yet unlabeled vertices in Vr will eventually be connected to vertices in Vl- Thus 
the only way that an edge between vertices in Vjr and Vr can be avoided is if the number of 
unused labels in M exceeds the number of possible edges incident to vertices in Vj. Since G is 
to have bandwidth k, its vertices may have degree as large as 2k. Thus the number of possible 
edges incident to vertices in Vl is 2k\Vi\ > Tf ^n + 2k. The number of unused labels in M 
cannot exceed \M\ < jfen - 2, which is leas than the number of possible edges to vertices in 
Vi. Thus there must exist some edge between vertices in Vl and Vr. Hence the bandwidth of 
/ is at least 



It remains to be shown that G has bandwidth k. Each vertex in Vl is adjacent to at most 2k 
vertices in Vjf U Vr. There are no edges between vertices in Vl, and no edges between vertices 
in V&s U Vr. Thus G consists of \L\ connected components, each of which is a star with 2k + 1 
or fewer vertices. By Lemmas 6.1.17 and 6.1.18, G has bandwidth k. 

As was mentioned above, the proof of the case that all of the labels in R are used before all 
of the labels in L is symmetric, and hence omitted. □ 

Note that the difference between the result achievable by the algorithm OLBW in Theo- 
rem d.1.4 and this lower bound is only about 2 {fl+ih n * wnich is less than jfc. Thus the algorithm 
OLBW achieves near- optimal performance on all graphs except those with very small band- 
width. Eor example, if G has bandwidth k = ~ for some constant c, then OLBW outputs a 
function whose bandwidth is only an additive constant greater than the lower bound. 

0.1.4 Other Online Protocols 

We wish to consider other possible protocols for online algorithms. In the protocol denned in 
Section 6.1.1, which we will henceforth refer to as Protocol 1, the information that the online 
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algorithm was given for each vertex was limited to a list of the vertices in its adjacency list that 
it had already labeled. We define two new online protocols, both of which permit an algorithm 
to see more of the graph before producing its output than is allowed under Protocol 1. We 
then prove lower bounds on the bandwidths of the functions constructed by any algorithms 
operating according to these protocols. 

One logical extension to the first protocol is to permit the algorithm to see the entire 
adjacency list of the current vertex, rather than just the restricted adjacency list. Any Protocol 
1 algorithm, such as OLBW, can be readily adapted to operate according to this new protocol 
(Protocol 2) with no loss in its power; it is possible, however, that there are Protocol 2 algorithms 
that perform better than any Protocol 1 algorithm. This is suggested by the observation that 
the proof of the bound on the performance of any Protocol 1 algorithm given in Theorem 6.1.16 
does not apply to this new protocol. 

Definition 6.1.19 An algorithm A is a Protocol 2 online bandwidth algorithm if its in- 
put/output behavior is as follows. Initially A is given as input (for some graph G) the number 
of vertices n and the bandwidth k. Then, for some ordering of the vertices vi,v 3 ...,»», A is 
presented the adjacency lists of the vertices in that order. After the list for t?; is seen, A must 
output the value of before it is shown Adj{v i+ i). The decision made by A as to the value 
of fM is irrevocable. When all of the adjacency lists have been seen by A, it must have defined 
the values of f such that f is a bandwidth function for G. 

Definition 6.1.20 A Protocol 2 online bandwidth algorithm A is a Protocol 2 online m- 
bandwidth algorithm if, for any n and k, for any graph G with n vertices and ban '.width 
h, and for any ordering of the vertices of G, the function f defined by A is an m -bandwidth 
function. 

Note that this type of protocol might also be adapted to other online graph problems, such 
as graph coloring. 

Theorem 6.1.21 For any n and k f and for any Protocol 2 online bandwidth algorithm A, 
there exists a graph G with n vertices and bandwidth k such that the function f output by A 
has bandwidth at least *jj^n - f . Thus there is no Protocol 2 online (±gln - 2)~bandwidth 
algorithm. 



Thus for large k the lower bound is only about one quarter the size of the bound obtained 
for Protocol 1 algorithms. 

Proofi Given n, Jb, and A satisfying the hypothesis, we define a graph G with the pi operties 
described. 

We define G by descrioing the adjacency lists of its vertices. Let v u v 2i ...,v n be the 
vertices of G in the order in which their adjacency lists are shown to A. As in the proof of 
Theorem 6.1.16, we partition the set of labels, {1,2 . . n}, into three sets. Define 

H u f£l +2 }' 

*-{[aMs1 + 1%H-+ 

and 

Ji= {[ 2 ir iB ]- 1, l% 2n J-- n }- 

Let 3 = l^l^nj. 

The adjacency lists given as input to A are as follows. Let t\ be the current vertex. If A 
has not yet used any of the labels in L, or if A has not yet used any of the labels in 12, then 
Adgfa) - {*{}, where t is minimal such that t > a and the number of edges seen so far that are 
incident to v t is less than 2k - 1 (it will be shown below that such t < n exists). The other case, 
in which A has already used labels from both L and J2, is handled as follows. Let vi and v r be 
the first vertices to be assigned labels in L and R, respectively. Assume that I < r; the proof 
in the other case is exactly analogous. We must define Adjfa) for each t > max{/,r} = r. For 
some o, b > s, Adj(vi) - {*«} and Adj{v v ) = {*&}. Define Adj{v a ) = {v„}, Adjfa) = {t>„}, and 
Adj{vn) - {v^ %h ( We can do this since o, 6, and n are at least s, which will be shown below 
to be greater than r, and thus this won't contradict any adjacency lists defined earlier.) Pbr 
all j > r such that j is not equal to a, ft, or n, define Adj{vj) to be consistent with the edges 
already seen (no new edges are added). 

To see that the G is well-defined, we must show that each adjacency list was defined only 
once. First, we show that r < s. The largest that I can be is \M\ + \R\ + 1. Similarly, 
r < JAf | + \L\ + 1. Since |X| = |JZ|, we get 

r<\M\ + \L\ + l= [^p»J ~ 1 < [**2T*\ = *• ( 6 ' 3 > 
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We must also show that v n was not put into the adjacency list of any vertex other than v a 
and t%; i.e. we must show that t is always less than n. Since t is defined only for vertices in 
fa, . . tv }, it is sufficient to demonstrate that there are enough vertices in fa, tv+i , . . . , u n _i } 
to have edges to r different vertices. Each vertex in fa, . . ., tv»-i} is on at most 2k - 1 edges 
incident to vertices in -fa, vj, . . ., tv}, so the number of different vertices that can have edges 
incident to vertices in fa, t>»+i» • • • » is 

\{v„v,+ u ...,v n - l }\{2k - 1) = (n - s)(2Jb - 1) > > r, 

by (6.3). Thus t < n, so G is well-defined. 

Note that fa, v«, ««, t*, tv) is a path of length four from *i to v r . Since /fa)-/fa > jM|+l, 
at '.east one of /(tv) - /fa), /fa) - /fa), /fa) - /fa), and /fa) - /fa) must be ^ or 
greater. Thus fhi bandwidth of / is at least 

\M\ + 1 _ n-lX|-jJZl-H n - 2(& + 3) + 1 _ 5 
4 4 4 " 4& n ~ 4" 

Finally, we show that G has bandwidth Jfe. Each vertex in fa,tV+i,. . t> n -i} - fa, »*} is 
the center of a star with no more than 2k vertices. Each of these connected components has 
bandwidth at most &, by Lemma 6.1.18. The remaining component of G resembles two stars, 
centered at v a and except that v a and t^ are both adjacent to tv Let ma be the number 
of other vertices (in addition to v n ) adjacent to r a , and m& be the number of other vertices (in 
addition to tv,) adjacent to t*. Both ma and m* are less than or equal to 2k - 1. 

We define a ^-bandwidth function / for this component as follows. Let /fa) = p^±I] + i, 
and let / assign to the other mo vertices (aside from v n ) that are adjacent to v a the other labels 
that are les* than or equal to m,, + 1. Set f{v n ) = m a + 2. Finally, let /fa) = m,, + \^^-] + 2, 
and let / assign to the other m* vertir.es (aside from tVi) that are adjacent to the remain- 
ing labels m« + 3, m» + 4,...,m« + f=^tl] + 1, ma + P^ti] +3,...,Tn 8 + m i + 3. Since / is 
a ^-bandwidth function for this connected component, G has bandwidth k, by Lemma 6.1.17. □ 

A third definition of an online protocol is to allow the algorithm to see the same information 
as in Protocol 2, but permit the algorithm to choose which vertex it wants to label next, rather 
than allow an adversary to make the decision. Clearly h ly Protocol 2 algorithm can be readily 
adapted to perform according to this protocol (Protocol 3) with no loss in its power. Since the 
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above proof of the Protocol 2 performance bound does not wink for Protocol 3 algorithms, it 
is possible that there are more powerful algorithms that operate under the new protocol* 

Definition 6.1.22 An algorithm A is a Protocol 3 online bandwidth algorithm if its in- 
put/output behavior is as follows. Initially A s given as input (for some graph G) the number 
of vertices n and the bandwidth k. A then selects a vertex v and is shown Adj[v). After the 
list for v is seen, A outputs the value of f{v). The decision made by A as to the value of 
f(v) is irrevocable. Then A selects a new vertex v, and the process is repeated. When all of 
the adjacency lists have been seen by A, it must have defined the values of f such that f is a 
bandwidth function for G. 

Definition 8.1.23 A Protocol S online bandwidth algorithm A is a Protocol 3 online m- 
bandwidth algorithm if, for any n and k, for any graph G with n vertices and bandwidth 
k, and for any ordering of the vertices of G, the function f defined by A is an m-bandwidth 
function. 

Like Protocols 1 and 2, this protocol can also be adapted to other graph problems. 

Theorem 6.1.24 For any k > 1, for any € > Q, and for any Protocol S online bandwidth 
algorithm A, there exist n and a graph G with n vertices and bandwidth Jfc such that the function 
f output by A has bandwidth greater than (2 - e)k * Thus, for any e > 0, there is no Protocol 3 
online (2 - c)k 'bandwidth algorithm* 

Prooft Given 4, e, and A satisfying the hypothesis, we will define two graphs, G% and G%. 
G will be either G% or Gj, depending on the label A gives to the first vertex it sees* G\ and Crj 
will be shown to have the advertised properties. 

Choose n to be an odd integer snch that n > (max{4, £ + 2})k. Note that this implies that 
2k < n - 2k. Without loss of generality, let Vj be the first vertex that A selects. A is shown 
the adjacency list Adjfa) = {vi, «*,..., 9t*+i}, and must then define 

Suppose that 2k < f(vi) < n - 2*. We then set G = G\, where G\ is defined by the 
following adjacency lists. Jbr * = 2,3,...,2A+ 1, let 

A4f(vi) = M> 
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For i = 2k + 2,2k + 3,...,n, let 

All subsequent responses to A are then made according to these adjacency lists. 

Note that G\ consists of two connected components. The first consists of the subgraph 
induced by {tri, *a, . . . t vim+i). This subgraph is a star, since there is an edge from v\ to every 
other vertex in this subgraph, and these are the only edges in the subgraph. %he r em a in ing 
vertices induce the other connected component; in this subgraph each vertex Vj has an edge 
from every owier vertex in the subgraph that has an index between j - k and ; + jb, inclusive. 
Due to the nature of this component, we will refer to it as the le-braid. 

To see that Gi has bandwidth k, define g x as follows. 

+ 1 if* = l 

"*2 < *< k+ 1 
t if*>* + 2 

gi is a fc-bandwidth function for G\. 

Suppose that fM < 2* or f{v x ) >n-2k. We then set G = G 2 , where G 3 is defined as 
follows. Order the vertices according to the sequence (recall that n is odd) 

v n> r n-2i «»-4» • • •» "Si ?3i t>l» »2» *>4» i • • ^n-3» »n-l- 

There is an edge in G 3 between every pair of vertices that are within k positions of each other 
in this sequence. Note that G 3 has bandwidth k, since we can define a Jb-bandwidth function 
92 by setting gtfa) equal to »,-'s position in the above sequence. All responses to A are made 
according to this definition of G 3 . Note that Adjfa) as defined earlier is consistent with G 2 . 

It remains to be shown that the graphs G% and G 3 force / to have a bandwidth greater 
than (2 - e)k . 

Casb 1: 2k < f(vi) < n - 24, so G = G|. Once again, we partition the set of labels 
{l,2...,n} into three subsets. Let Af be the set mmfaA»hi g the smallest continuous sequence 
of labels that includes each of /(•»),.. M /(«tofi). Define X to be the set of labels less 
than the smallest label in M, and iZ to be the set of labels greater than the largest label in 
Af. Thus if little = min{/(» 1 ),/(t» 2 ) t ...,/(r 3 | (+1 )} and h p = max{/fo), /(*,), . . )}, 
then I = {l, 2, ... , /itt/e - 1}, Af = {little, little + 1 big}, and £ = + 1, fcy + 2, . . . , n}. 
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Case I- a; At least one of £ and R is equal to 0. Assume that 1 = 0. (The proof of the 
other case is analogous.) By the definitions of Af and Gi, there is some i < 2k + 1 such that o,- 
is adjacent to vi and /(«{) = 1. Since /(©i) > 2e, the bandwidth of / is at least 2k. 

Case 1-b: 1^0 and £ ^ 0. If there exist vertices r and y with /(z) £ £ and /(y) € R 
and such that there is an edge (z,y), then the bandwidth of / is at least \M\ + 1 > 2k + 2. 
Assume no such vertices exist. 

Define Vji#, and Vji to be the sets of vertices with labels in»Z, Af, and iZ, respectively. 

Lemma 6.1.25 There are at least k vertices not m Vl that are adjacent to vertices in Vl. 
There are at least k vertices not in Vr that are adjacent to vertices in Vr. 

Proofi We prove the result for Vl; the proof for Vr is similar. By the definitions of L and 
R, all of the vertices in Vl and Vr are in the fc-braid. For t = 1,2,. .., \L\, define vj • to be the 
tth-lowest indexed vertex in Vx; thus Jj < Jj < . . . < /j^j. Similarly, let v rj be the jth-lowest 
indexed vertex in Vr, for j = 1, 2, . . ., |J2|; hence r x < r 2 < . . . < r| R |. Think of the fc-braid as 
being a chain of vertices with vzfc+2 on the left end of the chain and v„ on the right end, with 
every vertex having an edge to each vertex within distance k of it. We will find a lower bound 
on the total number of distinct vertices in the &-braid that have edges to vertices in Vi- There 
are four cases to consider. 

1. Suppose that l\ < r\ and rj^j > /j^j. Thus v\ x and tv w are the leftmost and rightmost 
vertices, respectively, in the A-braid that are in Vl U Vr. Note that by the assumption 
above there are no edges between vertices in Vl and Vr, so 1\l\ < n - k - 1. The vertex 
»i, is adjacent to at least k vertices. If Zj < /i + Jfe, then vj, is adjacent to at least two 
vertices that are not adjacent to vj t : v\ x itself and vj,+*« If /j > Zj + Jb, then vj 7 is adjacent 
to at least k vertices that are not adjacent to vj,: each of vj,+i,via+2,...,t»i a +i. For 
each $ = 3, 4, . . ., \L\ the vertex tfy is adjacent to at least one vertex (tfy+i) that none of 
vj t , tfi s , . . ., vj,_, is adjacent to (since 1\l\ < n — i — 1 we don't encounter the problem of 
running into the vertices at the right end of the It-braid that have degree less than 2k). 
Since by hypothesis k > 2, there are at least k + 2 + \L\ - 2 = \L\ + * vertices adjacent 
to vertices in V^. 
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2. Suppose that rj < l x and /j^j > r^. Now v n and are the leftmost and rightmost 
vertices in the fc-braid that are in Vi U V R . We prove the same bound as in Part 1 by an 
analogous proof- 
Since there are no edges between vertices in Vi and V R , note that l\ > k + 1. The vertex 
rn w is adjacent to at least k vertices. If Jjj|_i > fj L | - *, then is adjacent to 
at least two vertices that are not adjacent to vj w : vi w itself and If «*jXri— i < 

+ A, then tn w _ t is adjacent to at least k vertices that are not adjacent to vi w : each 
°* ^m-i-ii^M-i-i* • • •* v l ]Li „ l -if Bor each i = \L\ - 2,|I| - 3,...,1 the vertex is 
adjacent to at least one vertex fo-*) that none of vi ]t]1 vj w _ l} . . is adjacent to 
(since l\ > k+ 1 we don't encounter the problem of running into the vertices at the left end 
ofthefc-braid that have degree less than 2k). Thus there are at least k+2+\L\-2 = \L\+k 
vertices adjacent to vertices in Vi. 

3. Suppose that l x < n and > r^j, so both the leftmost and rightmost vertices in V L uV R 
are in Vi. Let r be the rightmost vertex in V R% and let V^, and Vi, be the sets of vertices 
in V L to the left and right, respectivdy, of r. An argument similar to the one given in 
part 1 above shows that there are at least \V Ll | + k vertices adjacent to vertices in V Ll . 

Since the vertex r e V R lies between the vertices of Vj^ and 1%, and since there are 
no edges between vertices in Vj and V R , there are no vertices adjacent to vertices in 
both Vl x and Vj^. An argument similar to the one given in part 2 above shows that 
there are at least |Vl,| + k vertices adjacent to vertices in V^. Thus there are at least 
|ViJ + * + |Vx»J + k = JXJ + 2k vertices adjacent to vertices in V L . 

4. Suppose that rj < ^ and rjjj, > l ]L] . Thus the leftmost and rightmost vertices mV L uV R 
are in V R . There are no edges between vertices in Vi and Vr, so Zi > * + 1 and < 
n- k - I. The vertex is adjacent to at least 2k vertices, for each i = 2, 3,..., jXJ 
the vertex is adjacent to at least one vertex (rj<+*) that none of «l„ »!„..., is 
adjacent to (since /jjj < n - k - 1 we don't encounter the problem of running into the 
vertices at the right end of the fc-braid that have degree less than 2k). Thus there are at 
least 2k + 1X1 — 1 vertices adjacent to vertices in V L . 

Thus there are always at least \L\ + fc (distinct) vertices that are adjacent to vertices in V L . 
At least k of these are not in Vu proving the lemma. A similar argument shows the same result 
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V M contains 2k + 1 vertices not in the fc-braid. Since there axe no edges between vertices in 
Vi and Vji, all of the k or more vertices not in Vi that are adjacent to vertices in V L , and all 
of the k or more vertices not in Vr that are adjacent to vertices in Vr, must be in Vju . 

Suppose that there is no vertex in Vm that is adjacent to vertices in both Vi and Vr. Then 
the size of V M , and hence M , is at least 4* + 1. But by the definition of M, if mi and m 2 
are the vertices in V M with the smallest and largest, respectively, labels from M, then both 
mi and ma are in the star. Thus there is a path of length two or less from mi to m 3 , so 
/(m 3 ) - /(mi) >$ = 2k. Thus the bandwidth of / is at least 2k. 

Alternatively, suppose that there are h > 0 vertices in Vm that are adjacent to vertices in 
both Vl and Vr ("shared" vertices). Adjusting for the shared vertices, we get 

\Vm\ >2Jfe + l + 2*-fc = 4ife + l- fc. 

But at least one of the h shared vertices must have a label at least ^ away from the average 
value of the labeJs in M. Thus this vertex has a label at least ^Jii + away from the label 
of some vertex in Vl u Vr. Hence the bandwidth of / is at least 

2 2 2 2 2" 

Case 2: f{v\) < 2k or f(vi) > n - 2k, so G = G> We demonstrate a lower bound on the 
bandwidth of / for the case when f{v\) < 2k. The other case is symmetric. Since n > ( }+ 2)*, 
J < § - 1. Choose an integer d such that \ < d < There are 2dfc vertices with path 
length d or less from ©i (not including «i itself). Thus at least one such vertex u must have a 
label of 2dib or greater. Since /(*i) < 2fc, there is a path of length d or less from v\ to u, and 
/(«) - /(n) > 2d* - 2k. Thus the bandwidth of / is at least 

2dk — 2k 2k 2. 

m d = 2k - = (2 - > (2 - «)*. 

This concludes the proof of Theorem 6.1.24. □ 
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0.1.5 Discussion 

No algorithm that operates according to Protocol 1 or 2 will always output a function with 
bandwidth less than an appreciable fraction of n. Because of the weaker lower bound for 
Protocol 3 algorithms, it is possible that good algorithms may exist for this protocol. No such 
algorithm has yet been found, however. 

There are several areas ripe for future research. The performance bounds for all three 
protocols could be tightened. In particular, the best algorithm known under Protocols 2 and 3 
is OLBW. It seems likely that there are more powerful algorithms that are specifically designed 
to exploit the additional information that is available under these protocols. Also, the algorithm 
OLBW requires only modest computational resources. Algorithms that take better advantage 
of the unlimited time and space permitted by all three of these protocols might yield better 
results. It would also be desirable to find good algorithms that don't need to know the actual 
graph bandwidth at the outset. 

6.2 Online Algorithms for Vertex Subset Problems 

The problems that we consider next are a special class of vertex labeling problems that we call 
vertex subset problems. In these problems the objective is to construct a subset of the vertex 
set of G that satisfies certain constraints. Depending on the particular p*. *lem, the goal is for 
this subset to be either as large or as small as possible, subject to the constraints. Each vertex 
is either put into the set or kept out of it; thus the only labels to be assigned to the vertices 
are IN and OUT. 

The three online protocols denned for the bandwidth problem are readily adapted to the 
case of vertex subset problems. Let G = (V,E) be a simple finite undirected graph with vertex 
set V and edge set E. A vertex subset problem P for G consists of a constraint function 
g 1 2 V -* {0, 1} and a bit indicating whether the goal is to construct a maximum- or nunimum- 
sise subset of V. A solution to the problem P is a subset V of V such that g(V) - 1. 
The cardinality of V, along with the indicator as to whether large or small sets are desirable, 
determines how good a solution V is. 

For any vertex subset problem P we make the following definitions. 
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Definition 8.2.1 A Protocol 1 algorithm for P is any algorithm A with input/output behavior 
as follows. Let G = {V,E) be any simple finite undirected graph, and let v\, vi . . .,r n be any 
ordering of the vertices m V. At each stage s m 1,2. . ..,n, A behaves as follows: 

1. A is shown the restricted adjacency list Adj,(v $ ) of v t . 

2. A assigns to the vertex v, either the label IN or the label OUT. 

LetV CV be the set of vertices assigned the label IN by A. Then V is a solution of P. 

Definition 6.2.2 A Protocol 2 algorithm for P is any algorithm A with input/output behavior 
as follows. Let G = (V, E) be any simple finite undirected graph, and let oi, t?2 . . ., v n be any 
ordering of the vertices in V. At each stage s — 1, 2, . . . , n, A behaves as follows: 

t. A is shown the adjacency list Adj{v t ) ofv,. 

2. A assigns to the vertex v, either the label IN or the label OUT. 

Let V CV be the set of vertices assigned the label IN by A. Then V is a solution of P. 

Definition 0.2.3 A Protocol 3 algorithm for P is any algorithm A with input/output behavior 
as follows. Let G = ( V, E) be any simple finite undirected graph. At each stage s — 1, 2, . . . , n, 
A behaves as follows: 

1. A selects a vertex v € V that it has not yet labeled. 

2. A is shown the adjacency list Aaj(v) of v. 

3. A assigns to the vertex v either the label IN or the label OUT. 

Let V C V be the set of vertices assigned the label IN by A. Then V is a solution of P. 

As was the case for the bandwidth problem, a Protocol 1 algorithm must assign a label to 
v, having seen only the adjacency list for v, restricted to those vertices with index less than s 
(i.e., those already labeled). A Protocol 2 algorithm is permitted to see allaf the edges incident 
to v, before assigning a label to v, t and a Protocol 3 algorithm is allowed, in addition, to choose 
the order in which it labels the vertices. All of this is exactly analogous to the definitions for 
the graph bandwidth problem. 



112 



6.2.1 The Online Independent Set Problem 

The first vertex subset problem that we consider is the problem of finding a large independent 
set of a graph. 

Definition 6.2.4 For any graph G = a subset V 1 of V is an independent set of G if 

there are no edges between vertices in V. 

The general problem of finding the maximum-size independent set of a graph has been shown 
to be NP-complete [28, 43]. Here we are interested in online algorithms to find large, although 
not necessarily maximum-size, independent sets. Clearly this is a vertex subset problem; the 
goal is to construct a large subset V* of V, subject to the constraint that none of the vertices 
in V are adjacent. (Thus the constraint function g maps V to 1 if there are no edges between 
vertices in V, and 0 otherwise.) 

Jjl the remainder of this subsection we use k to denote the size of the maximum-size inde- 
pendent set of a graph and n to denote the number of vertices in the graph. 

Protocol 1 algorithms are not powerful enough to find large independent sets. As the 
following theorem shows, algorithms operating according to this protocol cannot be guaranteed 
to do better than even the most naive algorithm. 

Theorem 6.2.5 There is no Protocol t independent set algorithm that, for any graph G, always 
outputs an independent set of size greater than ^jk. 

Proofi The result is easily proved by defining, for any Protocol 1 algorithm A, a graph 
G with n vertices that forces A to output a singleton set. Suppose A gives «i the label IN. 
Then define G to be the n- vertex graph in which there is an edge from »i to every other edge 
in the graph. There are no other edges. Thus fa, vs,. . is an independent set of size 
n - 1. Since A cannot assign the label IN to any vertices other than ©i without violating the 
constraint function, the result holds. 

Suppose that A assigns «i the label OUT. Until A has assigned some vertex the label IN, 
let all of the restricted adjacency lists that it sees be empty. Let «j be the first vertex assigned 
the label IN by A. Then define G to be the n- vertex graph with an edge from v,- to each of 

»«*+»»•..*»». Thus A cannot assign the label IN to any other vertices; since V - {vi} is an 
independent set, the result follows. (Note that if A never assigns the label IN to any vertex, 
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the set it outputs is empty and V itself is an independent set.) 



□ 



Since an algorithm that just assigns the first vertex it sees to the independent set is guaran- 
teed a set of sue at least 1 > £*, this bound makes it clear that Protocol 1 algorithms are too 
restricted to offer satisfactory performance guarantees for the independent set problem. The 
problem is more interesting when we consider algorithms that operate under Protocols 2 and 
.3. 

Theorem 0.2.6 There is no Protocol S independent set algorithm that, for any graph G, always 
outputs an independent set of size at least ^-Jry** 

Proofi Let n > 4 be any perfect square. For any independent set algorithm A that operates 
under Protocol 2, we define an n- vertex graph G = ( V, E) containing an independent set of size 
k for which A outputs an independent set of cardinality less than k. 
Define ^ = Z7uV 1 uF 2 U...U V^, where 

and 

for each * = 1, 2, ... , v"n. The first y/n vertices that A will be shown are those in U. For each 
Ui e Uj A is told that the vertices adjacent to u,- are exactly those in V<. A then gives u* either 
the label IN or OUT. Let 0*, and U„* be the sets of vertices in U to which A has assigned the 
labels IN and OUT, respectively. At this point we define the rest of the edges in E. 

Ear each i and j such that u»,Uj e U^, there is an edge between each pair of vertices in 
Vi U Vy Thus the set 

U * 

induces a complete subgraph of G with |Z7 0 n«|(V?> ~ 1) vertices. We denote this complete 
subgraph by K. 

For each « such that u,- € 0*,, there are no edges between any pair of vertices in Vi, nor are 
there edges to any vertices in any other set V y 

Note that each m € has already been labeled OUT by A. Furthermore, since K 
is a complete subgraph, at most one vertex in K can be labeled IN, by the definition of an 
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independent set. Foi each » such that ttj € Um, none of the vertices in Vi can be assigned the 
label IN, since they are each adjacent to Uj, which has already been labeled IN. Thus A can 
assign at most |tr*»| + 1 vertices of V the label IN. Let k A be the size of the independent set 
that A will output for G. Then k A < |ty + 1. 
The set 

IU*U (J Vi 

is an independent set for G of cardinality k — jt^J + \Ui n \(y/n - 1). Thus 



Since 11^1 = ^-1^1. 



k A - + 1 



Win] + 1 
Win] + 1 



> v/n -2. 



Thus k A < ^i— . 



A slightly weaker bound applies to algorithms that operate under Protocol 3. 

Theorem 6.2.7 There is no Protocol S independent set algorithm that, for any graph G, always 
outputs an independent set of size at least ^Lfc, where n is the number of vertices in G and k 
is the size of the maximum independent set m G. 

Proofi Let c be a positive integer. Given a Protocol 3 algorithm A and the value c we 
define a graph G fat which A outputs an independent set of cardinality less than ^Lfc, where 
n is the number of vertices in G and h is the sise of the largest independent set in G. We 
can define such a graph with an arbitrarily large number of vertices by choosing c arbitrarily 
large. Alternatively, we can construct an arbitrarily large graph with this property by using 
the technique described below to define a subgraph with the pr o pe r ty. The procedure can then 
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be iterated as many times as necessary, using different vertices each time, to produce a graph 
as large as desired. 

Let k\ denote the size of the independent set output by A. We first define G and prove 
that ^ < J. The graph G is denned according to the order in which A chooses to label the 
vertices and the labels A assigns to those vertices. Let v\ be the first vertex that A selects to 
label Let its adjacency list consist of c new vertices. (In this proof not all of the vertices will 
be given explicit names. A "new" vertex is one that A has not yet queried, and that has not 
appeared in any adjacency list already shown to A.) If A assigns the label IN to t?i, then the 
definition of G is complete; it is the (c +• 1)- vertex graph defined by the adjacency list of t»j. A 
cannot assign the label IN to any of the vertices adjacent to »i, so k A = 1. Since the size k of 
the largest independent set for G is clearly c, ^ = J. 

If A gives vi the label OUT, then we must describe how subsequent queries are handled, 
and thus how G is defined. for i = 2,3,. ..,2c- 1, if A has assigned the label OUT to each of 
the first * — 1 vertices it has queried, then the tth query is responded to as follows. Let Vi denote 
the vertex that A chooses as the tth vertex to label. The adjacency list for r» consists of 2c - 1 
new vertices, as well as any other vertices that must be included (i.e. any previously-labeled 
vertices in whose adjacency list »< appeared). If A assigns the label OUT, then go on to the 
next query. If A gives t>< the label IN, then we complete the definition of G by adding a new 
vertex w and a number of new edges. Let 5 be the set of vertices consisting of w and each 
vertex that has already appeared in some adjacency list, has not been queried (labeled), and is 
not adjacent to V{. Add edges between every pair of vertices in 5, so that 5 induces a complete 
subgraph of G. This completes the definition of G. AH responses to subsequent queries by 
A are, of course, based on this definition of G. Since 5 induces a complete subgraph, A can 
assign at most one vertex in 5 the label IN. Since Vi is the only vertex A has already labeled 
IN, kx < 2. However, the set consisting of w and the 2c - 1 new vertices that appeared in the 
adjacency list for «i forms an independent set for (?, so k = 2c Thus V* - K = c • 

It remains to consider the case in which A assigns the label OUT to each of the first 2c - 1 
vertices that it querns. Suppose this is the case, and that thus for G contains the vertices and 
edges defined in the responses to the first 2c — 1 queries of A. The remainder of G is defined as 
follows. One new vertex, z, is added. Let K be the set of vertices consisting of z and all vertices 
in G that were not among the first 2c - 1 queried and labeled by A. (Thus the only vertices 
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In G that are not in K are the first 2c - 1 vertices that A labeled) Add an edge between each 
pair of vertices in if, so that K induces a complete subgraph of G. Thus A can assign the label 
IN to at most one vertex in K. Since A has already assigned the 2c - 1 vertices not in K the 
label OUT, k A ~ 1. 

Let Q be the set of the first 2c - 1 vertices queried by A. In order to construct a lower 
bound on k, first note that the subgraph induced by Q is a forest. This can be seen as follows. 
Ear each j = 1,2,... 2c - 1, let Vj € Q denote the jth vertex queried by A. Suppose there is 
an edge between some t% 6 Q and with h < For any i between h and ;', at the time of 
the f th query Vj is no longer a new vertex. Since Vj has not previously been queried, there is 
no way that it could be in the adjacency list of oj. Thus vj is adjacent to at most one vertex 
vh with h < j. Hence there are no cycles in the subgraph of G induced by Q, so the subgraph 
is a forest. 

Lemma 6.2.8 Any forest F with ra vertices has an independent set of size at least y. 

Proof of Lemmas Consider any connected component C of F. Choose an arbitrary vertex 
in C as the root, and express C as a rooted tree T. Define EVEN to be the set of vertices 
in C at even levels of T, and ODD to be the set of vertices in C at odd levels of T. Take 
whichever of these two sets is larger, and add its contents to the independent set. Repeat for 
each connected component of F. Clearly the result is an independent set that contains at least 
half of the vertices in each component of F. This proves the lemsua. □ 

Thus there is an independent set for the subgraph of G induced by Q of size at least ty. U 
we add the vertex x to this set we get an independent set for G of size ifl + 1, so 

IGI , , _ 2C-1 1 
*> T + 1 = — + l = c+->c. 

Thus J. 

It remains only to express the upper bound on in termsof n and *. We first establish an 
upper bound on n, the number of vertices in G. There are c + 1 vertices introduced in response 
to the first query. In each of the (at most) 2c- 2 other queries needed before G is completely 
defined, 2c -1 new vertices are introduced in the adjacency list. In addition, the queried vertex 
itself could be a new vertex. Finally, the vertex * (or w) is added to G. Thus an upper bound 
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on the number of vertices in G is 

n < (c + l) + (2c-2)(2c-l + l) + l 
= 4c 2 -3c + 2 
< 4c 8 . 

Therefore c > so J < Consequently, ^ < so < □ 

There is no known polynomial-time algorithm that is guaranteed to always output an in- 
dependent set of size within a constant factor of optimal. In fact, it has been shown that the 
existence of any such algorithm would imply the existence of a polynomial-time algorithm that 
always outputs an independent set of size within a factor of e times the optimal, for any c > 0 
[28]. This provides strong evidence that no constant factor polynomial-time approximation 
al gor ithm exists. The results of this section show that, even if the restriction on running time 
is removed, no good approximation algorithm exists that relies exclusively on local heuristics. 

6.2.2 The Online Vertex Cover Problem 

Another vertex subset problem is the problem of finding a small vertex cover of a graph. 

Definition 6.2.9 For any graph G = (V, E), a subset V ofV is a vertex cover for G if, for 
every edge (u, v) in E, at least one of u and v is inV. 

The general problem of finding the minimum-size vertex cover of a graph has been shown 
to be Incomplete [28, 43). We are interested in online algorithms to find small, although not 
necessarily minimum-size, vertex covers. The vertex cover problem can be seen to be a vertex 
subset problem as follows. The object is to construct a small subset V of V, subject to the 
constraint that each edge in E is incident to at least one vertex in V. (Thus the constraint 
function g maps V to 1 if this condition is met.) 

In the remainder of this subsection k is used to denote the cardinality of the minimum- size 
vertex cover of a graph G. 

As was the case for the independent set problem, the only performance guarantees for 
Protocol 1 vertex cover algorithms are extremely weak. It can easily be shown that no Protocol 1 
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Algorithm VC 



1. Initialise the set PUT-IN-COVER = 0. 

2. For each vertex v to be labeled do: 

(i) If v € PUT-IN-COVER then assign v the label IN. 

(ii) Eke, if there is some w 6 -<Mf(v) *neb that w has not 
yet been labeled and w g PUT-IN-COVER then assign 
v the label IN and put w into PUT-IN-COVER. 

(in) Else assign v the label OUT. 



Figure 0.2; Protocol 2 algorithm to find vertex cover of size at most 2k 



algorithm for the vertex cover problem always outputs a cover of size smaller than (n - l)k. 
The proof of this is omitted. 

Much better results can be obtained when an algorithm is allowed to operate under Proto- 
col 2. The following theorem gives an algorithm that always outputs a cover of cardinality at 
most twice the size of the optimal (smallest) cover. 

Theorem 6.2.10 There is a Protocol 2 (and hence Protocol 3 as well) vertex cover algorithm 
that for any graph G always outputs a vertex cover of size at most 2k. 

Proofi Given a graph G = (V, E). The Protocol 2 algorithm VC (Figure 6.2) implements the 
well-known approximation algorithm that constructs a vertex cover consisting of the vertices 
incident to edges in a maximal matching of G [28]. 

Let Af C .E be the set of edges {v, w) such that v is assigned the label IN and w is put into 
PUT-IN-COVER in the same execution of Step 2(ii) of the algorithm VC. We show that M is 
a maximal matching for G. 

Since all vertices in PUT-IN-COVER are labeled IN, Step 2(H) of VC is equivalent to 
assigning the label IN to both v and v. Since v and w axe adjacent, this is tantamount to 
adding the edge (e, «) to Af. Once such a Step 2(ii) has been executed, neither v nor v will 
satisfy the conditions of that clause in any future iteration. Thus neither will be incident to 
any edge added in a later iteration. Consequently no vertex lies an more than one edge in Af, 
so M is a matching. 
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lb see that Af is maximal, suppose that (r, s) is an edge not in Af such that neither r nor 
a is incident to any edge in Af. Without loss of generality, assume that r was labeled before a. 
Consider the iteration of Step 2 in which r was labeled. At that time, since neither r nor a is 
incident to any edge in Af, neither would have been in PUT-IN- COVER. Thus the condition of 
Step 2(i) would not have been satisfied, but the condition of Step 2(ii) would have been. Thus 
either both r and a would have baen given the label IN, and (r, a) would be in Af , or else some 
other edge (r, t) would be added to Af . This contradicts our assumption, so no such edge (r, a) 
can exist. Hence Af is a T " tt " mal matching. 

Clearly the v erti ces that are assigned the label IN are exactly those that are incident to 
an edge in the maximal matching Af ; let Vf n be the set of such vertices. Since Af is maximal, 
there is no edge with botr its vertices in V — V^m and thn~ is a vertex cover for G. Since 
any vertex cover must include at least one vertex incident to each edge in Af , and since no two 
edges in Af have a vertex in common, k > \M\ = Thus |%,| < 2k. (These properties of 
the maximum matching approach to finding approximate solutions to the maximum-size vertex 
cover problem appear in [28].) □ 

The following result shows that this algorithm is optimal among Protocol 2 algorithms. 

Theorem 8.2.11 For any e > 0, there is no Protocol 2 vertex cover algorithm that, for any 
graph G, always outputs a vertex cover of size leas than (2 - e)k. 

Proof: Given any such e and algorithm A, we construct a graph G fat which A outputs 
a vertex cover of size at least (2 - e)k. Let m be an integer such that m > \. Then G win 
have either 2m or 3m vertices, depending on the labels that A assigns to the first m vertices 
it sees. Thus by selecting m sufficiently large we can construct an arbitrarily large graph with 
the desired propert y. 

The first m vertices that A labels are Hi, u ll ...,u m (in that order). As long as A doesn't 
assign the label OUT to any of these vertices, for each ti< A is told that u,- is adjacent to 
the vertices vi,oj,.. .,«*»• There are thus two cases to consider: either A assigns each of 
«i, us, the label IN, or else A assigns some u% the label OUT. We denote the size of the 
vertex cover output by A by A*. 
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Case 1: A assigns the label IN to each of t*j, u 3 ,...,u m . Each u< is adjacent to each of 

* 

vi, «*,..., ifo, as mentioned above. The remainder of the graph G is then defined as follows. 
We add m more vertices u^toa,.. w m to G, and for each t = l,2,...,m, there is an edge 
{vi,Wi). For each *, A must assign the label IN to at least one of v,- and to<. Since 4 has already 
given the label IN to each of «i, «!,...,«„„ k A is at least 2m. However, t> 2 , ...,«„} is a 
vertex cover for G, so h = m. Thus Ju > 2*. 

Case 2: A assigns the label OUT to some vertex m {u x ,u a ,. ..,«„,}. Let / be such that 
ui+i is the first vertex assigned the label OUT by A. Once uj +1 is assigned the label OUT, no 
more edges are added to G, and the definition of <? is complete. (The vertices uj +3 ,..., are 
in G but have degree zero.) Thus G is a bipartite graph, where the vertex set V is partitioned 
into the two subsets V u = ,1^} and V r = {vi,trj,.. .,p m }. For each » < / + 1, 

there is an edge from Uj to each vertex in V„. In order to cover the edges incident to A 
must assign the label IN to each vertex in V v . Since A has already assigned the label IN to 
each of Ui,t*j,...,tt|, the sire « 4 of the cover output by A is at least m + Z. However, the set 
{tti,« 2 ,...,ui + i} is a vertex cover for G of minimum aige f so * = / + 1. Thus ^ > To 
complete the analysis, we split this case into the following two subcases. 

Case 2-a: m > / + 1, so m > / + 2. Thus 



Case 2-b: m = / + 1. Thus 



* - / + 1 - 2 ' 



id > m +* 
* - i+i 

2/ + 1 
/ + 1 

= 2-1 
m 

> 2-t. 



Thus > 2 - e, so k A > (2 - «)*. □ 

The proof of this lower bound does not hold for Protocol 3 algorithms. Thus it is possible 
that there are algorithms operating under the third protocol that perform better than the 
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Protocol 2 algorithm given above. The following theorem gives a smaller lower bound on the 
vertex cover output by any Protocol 3 algorithm. 

Theorem 6.2.12 There is no Protocol S vertex cover algorithm that, for any graph G, always 
outputs a vertex cover of size less than f &. 

Proofi Let A be any Protocol 3 vertex cover algorithm. We give a method to construct 
a graph G, with arbitrarily large minimum-size vertex cover k and number of vertices n, such 
that A outputs a cover of size at least The graph G consists of a collection of connected 
components, each with either 3 or 5 vertices. 

Suppose A requests the adjacency list for some vertex v;. If the connected component 
containing has already been defined, then the adjacency list oft*,- is derived from the definition 
of the connected component, and given as input to A, If the connected component containing 
Vi has not yet been defined, then A is told that o; is adjacent to the vertices U and u,- (neither 
of which has appeared in any previous adjacency list shown to A). If A then assigns to t% the 
label OUT, then the vertices U* «<» and ttj induce the entire connected component, which is thus 
just the simple path of three vertices fc, uj, u,). A must then assign both U and Ui the label IN, 
so as to cover both edges in the component. Thus A uses two vertices to cover this component, 
when it is possible to cover it with the single vertex v<. 

On the other hand, suppose that A assigns to v; the label IN. Then the connected component 
is denned to include, in addition to t*, u,, and »», two new vertices r,- and s< (neither of which 
has appeared in any adjacency already shown to A). In addition to the edges already defined, 
the component also includes the edges (r,-, U) and (tit,*,-). Thus the connected component is 
the simple path of five vertices (r<,<*, m, u,-, *,-), with «j the middle vertex on the path. In order 
to cover the edges (r,-, U) end (u<, A will have to assign the label IN to at least one of r< and 
U and at least one of u* and s%. Since A has already given v< the label IN, it will use at least 
three vertices to cover the connected component, although it is possible to do so with only two 
vertices, U and «*. 

This procedure can be iterated until a graph has been constructed with n and k as large as 
desired, for each connected component, A uses at least $ times the number of vertices as are 
necessary to cover the component. This proves the theorem. □ 
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■ 6.2.S The Online Dominating Set Problem 

m Finally, we consider the problem of finding a small dominating set of a graph. 

Definition 6.2.13 For any graph G - (V,E), a subset V'ofVisa dominating set of G if, 

I for every vertex v € V, either v £V f or else v is adjacent to some vertex in V. 

_ The dominating set problem is a vertex subset problem where the goal is to construct a 

W small subset V* of V, subject to the constraint that each vertex not in V"is adjacent to a vertex 

m in V. As was the case with the vertex subset problems discussed earlier, the problem of finding 

V the minimum-size dominating set of a graph is NP-complete [28]. In what follows k is used to 
m denote the size of the minimum-size dominating set for a graph G, 

■ Once again, it is easy to show that Protocol 1 algorithms can offer only extremely weak 
m performance guarantees. No Protocol 1 dominating set algorithm always outputs a dominating 

■ set with less than (n - l)k vertices. The proof of this is omitted. 

m The following theorem establishes a lower bound on the size of the dominating set guaranteed 

V to be output by any Protocol 3. algorithm. Since any Protocol 3 algorithm can easily be adjusted 
jm to operate according to Protocol 2, the bound holds for Protocol 2 algorithms as well. 

Theorem 6.2.14 There is no Protocol 3 dominating set algorithm that, for any graph G, 

■ always outputs a dominating set of size less than (2n) 1 ^ i k. 

m Proofs Let c > 2 be a positive integer. Given a Protocol 3 algorithm A and the value c we 

■ define a graph G for which A outputs a d«mWthi g set of cardinality at least (2n) 1 / 3 Jk, where 
m n is the number of vertices in G. We can define such a graph with an arbitrarily large number 

■ of vertices by choosing c arbitrarily large. Alternatively, we can construct an arbitrarily large 

■ graph with this property by using the technique described below to define a subgraph with 

* the property. The procedure can then be iterated as many times as necessary, using different 

■ vertices each time, to produce a graph as large as desired. 

* Let kji denote the size of the dominating set output by A. We first define G and prove that 

■ > c The graph G is denned according to the order in which A chooses to label the vertices 
" and the labels A assigns to those vertices. Let v% be the first vertex that A selects to labeL 
| Define its adjacency lfot to be U - {«!, 1*3,. . .,««,}. If A assigns the label OUT to vi, then the 
™ definition of G is complete; it is the (c + 1)- vertex graph defined by the adjacency list of n. A 
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must assign the label IN to each vertex in U % so Ha - c. Since {v\) is a dominating set for G, 

If A gives the label IN, then we must describe how subsequent queries are handled, and 
thus how G is defined. For » = 2,3,..., c- 1, if A has assigned the label IN to each of the first 
i - l vertices it has queried, then the sth query is responded to as follows. Let v; denote the 
vertex that A chooses as the sth vertex to label The adjacency list for consists of sc - 1 + 1 
new vertices and all vertices in U that have not previously been queried, as well as any other 
vertices that must be included by virtue of having already appeared in their adjacency list 
in response to an earlier query. (As in the proof of Theorem 6.2.7, not all of the vertices in this 
proof will be given explicit names; a "new" vertex is one that A has not yet queried and that 
has not appeared in any adjacency list already shown to A.) If A assigns «< the label IN then 
go on to the next query. If A gives Vi the label OUT, then the definition of G is complete, and 
all responses to subsequent queries by A are, of course, based on this definition. In this case, A 
must assign the label IN to each of the ic - i + 1 new vertices added in the sth query. Since A 
has already given the first * - 1 vertices it queried the label IN, k A > (ie -i + 1) ■+• (i - 1) = ic. 
However, the set containing the first i - 1 vertices queried by A and the vertex forms a 
dominating set for G\ hence k < i, so ^ > & = c. 

We still must consider the case in which A assigns the label IN to each of the first c - 1 
vertices that it queries. Suppose this is the case, and that thus far G contains the vertices and 
edges denned in the responses to the first c — 1 queries of A. The remainder of G is denned as 
follows. Let u be a vertex in U that has not yet been queried by A; at least one exists, since 
U fpnt^u* c vertices and only c - 1 queries have been made. Let to be a new vertex. Add an 
edge from u to every unqueried vertex in G, including w. Since there is already an edge from u 
to each vertex that has already been queried, u is adjacent to every other vertex in the graph. 
Thus {u} is a dominating set for <3, so k = 1. Since A has already assigned c - 1 vertices the 
label IN, and must also assign that label to at least one of u and to, k A > c. Hence ^ > c 

It remains only to express the upper bound on ^ in terms of n and k. We first establish an 
upper bound on n, the number of vertices in G. There are c+1 vertices introduced in response 
to the first query. For each s = 2,3,...,c- 1, at most sc-i + 1 new vertices are introduced in 
the adjacency list given in response to the sth query made by A. In addition, for each of these 
queries the queried vertex itself could be a new vertex. Finally, the vertex w may be added to 



124 



G. Thus an upper bound on the number of vertices in G is 

n < (c + l) + l + £(tc-s + 2) 

= c+2+(c~l)£» + 2(c-2) 

j»3 



= 3c 



Since e > 2, c* - f c + 1 > 0, so 



1 a i 5 Is 
2 C " c + 2 C -^ 2 C ' 



Hence n < Jc 3 , so c > (2n) 1 / 3 . Consequently, ^ > (2n)V* ; thus fc A > (2n) J / 3 i. 



6*2.4 Discussion 

Interestingly, even though each of these three vertex subset problems is NP-complete, there is 
a wide variation among the performance levels that can be achieved by Protocol 2 or 3 online 
algorithms for these problems. For both the independent set problem and the d«mfri»t fa g se t 
problem, no algorithm that operates according to either Protocol 2 or Protocol 3 will always 
output a vertex subset of size bounded by a constant times the size of the optimal subset. 
Even under Protocol 3, the best possible performance bounds for these problems differ from 
the optimal solutions by factors of and n 1 / 8 , respectively. In contrast, very good results 
can be obtained under both Protocols 2 and 3 for the vertex cover problem. This suggests 
that the performance of local heuristics for NP-complete graph problems is quite sensitive to 
the particular problem under consideration, even though the offline decision versions of these 
problems are of identical difficulty. 

On the other hand, for each of these problems any Protocol 1 algorithm has a similar 
(dismal) perform an re bound. 
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Note that most of the results in this chapter are negative, rather than positive, in nature. 
Although online algorithms are allowed unlimited time and space, because of the fact that at 
each stage in its execution only part of the graph is available to an online algorithm as input, 
there is only a limited amount of data for the algorithm to work with. Thus most known online 
algorithms for problems of interest run quite efficiently. Consequently, positive results (i.e. 
online algorithms with strong performance guarantees) for these problems might well imply the 
existence of polynomial-time approximation algorithms for NP-complete problems with similar 
strong performance guarantees. Such approximation algorithms have eluded researchers for a 
long time; this suggests that finding such online algorithms is not an easy matter. 
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7 SUMMARY OF RESULTS 



■ • If ft is a PAC-learnable representation class that is strongly polynomially closed under 

exception lists then there exists a randomiced polynomial-time (length-based) Occam 
I algorithm for ft. This result also holds in the case of learning one class in terms of 

another class and for polynomial predictability. 

P • If ft is a PAC-learnable r e pr e sen tation class that is polynomially closed under exception 

lists then there exists a randomised polynomial-time (dimension-based) Occam algorithm 
P for ft. 

■ • If F is a PAC-learnable family of Boolean formulas and FFt is polynomially predictable 

then F is ss-learnabie. Thus for any k € HNT, the families of monomials, iCNF formulas, 

■ *DNF formulas, and fc-dedsion-lists are ss-learnable. 

• If ft is a representation class that is polynomially learnable and such that ftftt is pre- 
I dictable then ft is ss-learnable. 

■ • If a representation class ft is polynomially learnable and there is a randomized polynomial- 

■ time hypothesis finder for ftft then ft is ss-learnable. Thus the class of axis-aligned 

■ rectangles in the Euclidean plane is ss-learnable. 

• If the representation class ft is polynomially learnable from positive examples alone then 

■ ft is ss-learna bl e. 

m • A family of Boolean formulas F is sc-learnable if and only if it is ss-learnable. A rep- 

■ resentation class ft over an unparameterised domain is sc-learnable if and only if it is 

■ ss-learnable. 

• The DFA-predictable classes of languages are exactly the finite classes of regular lan- 

■ guages. 

sj • The DPDA-predictable classes of languages are exactly the finite classes of deterministic 

W context-free languages. 
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The lCM-predictable classes of languages are exactly the finite classes of 1 -counter lan- 
guages. 

There is a Protocol 1 online algorithm that always outputs a ft*-*^ 1 -bandwidth function 
for any n-vertex graph with bandwidth k. No Protocol 1 algorithm always outputs a 
jJyn-2-bandwidth function. There is no Protocol 2 online ^n-2-bandwidth algorithm, 
for any e> 0, no Protocol 3 algorithm always outputs a (2 - bandwidth function. 

There is no Protocol 2 algorithm that, for any n-vertex graph with an independent set of 
size &, always outputs an independent set of size at least ^J^A. No Protocol 3 algorithm 
always outputs an independent set of size at least 

There is a Protocol 2 online algorithm that, for any graph with a vertex cover of size 
k, always outputs a vertex cover of size at most -fc. This is the best possible result for 
Protocol 2 algorithms. No Protocol 3 algorithm always outputs a cover of size less than 
*&* 

No Protocol 3 online algorithm always outputs a dominating set of size less than (2n) 1 ^k 
for any n-vertex graph with a dominating set of size k. 
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